Main Questions
Our main questions:
- How have youth disconnection rates in American youth changed since 2008?
- In particular, how has this changed for different gender and ethnic groups? Are any groups particularly disconnected?
Disclaimer: The purpose of the Open Case Studies project is to demonstrate the use of various data science methods, tools, and software in the context of messy, real-world data. A given case study does not cover all aspects of the research process, is not claiming to be the most appropriate way to analyze a given data set, and should not be used in the context of making policy decisions without external consultation from scientific experts.
This work is licensed under the Creative Commons Attribution-NonCommercial 3.0 (CC BY-NC 3.0) United States License.
To cite this case study please use:
Wright, Carrie, and Ontiveros, Michael and Jager, Leah and Taub, Margaret and Hicks, Stephanie. (2020). https://github.com/opencasestudies/ocs-youth-disconnection-case-study. Disparities in Youth Disconnection (Version v1.0.0).
First, let’s discuss the meaning of the term “youth disconnection”.
According to Measure of America (a nonpartisan project of the nonprofit Social Science Research Council that is focused on opportunity in the United States), disconnected youth are:
“young people between the ages of 16 and 24 who are neither working nor in school”
The group states that such disconnection hinders these individuals to acquire skills and create relationships necessary to have a successful adulthood.
The group goes on to state that:
“people who experience a period of disconnection as young adults go on to earn less and are less likely to be employed, own a home, or report good health by the time they reach their thirties”
Disconnected youth are also referred to as opportunity youth, which has the added positive connotation that promoting such individuals can be beneficial not only for these individuals, but also for their communities and for society.
Good news: According to this report, the youth disconnection is generally showing decreasing trends for the past 7 years.
Bad news: The same report shows racial and ethnic disparities, where some groups are showing increased rates of disconnection.
In this case study, we will expand beyond the Measure of America annual report to take a deeper look at differences in disconnection between different subgroups of youths. Identifying youths particularly at risk or disconnected, can help inform the design of targeted prevention and re-engagement strategies. To do this, we use the following article as our motivation for this case study.
Mendelson, T., Mmari, K., Blum, R. W., Catalano, R. F. & Brindis, C. D. Opportunity Youth: Insights and Opportunities for a Public Health Approach to Reengage Disconnected Teenagers and Young Adults. Public Health Rep 133, 54S-64S (2018).
The article describes strategies for prevention of disconnection and re-engagement of disconnected youth and how such interventions could greatly positively impact opportunity youth for the entire trajectory of their lives and for future generations. It also points out that indeed there are disparities among different racial/ethnic groups.
Our main questions:
In this case study, we will demonstrate how to import and wrangle data available in a Portable Document Format (PDF). We will especially focus on using packages and functions from the tidyverse, such as dplyr, ggplot2. The tidyverse is a library of packages created by RStudio. While some students may be familiar with previous R programming packages, these packages make data science in R more legible and intuitive.
The skills, methods, and concepts that students will be familiar with by the end of this case study are:
Data science Learning Objectives:
magick packagedplyr for data wranglingtidyr)tidyr)ggplot2 that are in a similar style to an existing imagecowplotggplot2)Statistical Learning Objectives:
We will begin by loading the packages that we will need:
library(here)
library(pdftools)
library(tesseract)
library(magick)
library(knitr)
library(dplyr)
library(stringr)
library(magrittr)
library(tidyr)
library(tibble)
library(ggplot2)
library(directlabels)
library(cowplot)
library(forcats)
library(Kendall)
library(patchwork)| Package | Use in this case study |
|---|---|
| here | to easily load and save data |
| pdftools | to import PDF documents |
| magick | for importing images and extracting text from images |
| tesseract | for extracting text from images with magick |
| knitr | for showing images in reports |
| dplyr | to filter, subset, join, add rows to, and modify the data |
| stringr | to manipulate strings |
| magrittr | to pipe sequential commands |
| tidyr | to change the shape or format of tibbles to wide and long, to drop rows with NA values, to separate a column into additional columns, and to fill out values based on previous values |
| tibble | to create tibbles |
| ggplot2 | to create plots |
| directlabels | to add labels directly to lines in plots |
| cowplot | to add images to plots |
| forcats | to reorder factor for plot |
| kendall | to implement the Mann-Kendall trend test in R |
| patchwork | to combine plots |
The first time we use a function, we will use the :: to indicate which package we are using. Unless we have overlapping function names, this is not necessary, but we will include it here to be informative about where the functions we will use come from.
So how does youth disconnection happen and what impact does it have?
There are many known risk factors, which have been identified in a variety of contexts (from family, friends, school, community, society) including:
These risk factors make it more likely for young people to miss out on education, training, and networking that can act as a foundation for a successful career.
There are also many known negative consequences associated with youth disconnection including but not limited to:
Photo by Jon Tyson on Unsplash
Furthermore, in 2012 it was estimated that each disconnected youth costs taxpayers $250,000 during a lifetime due to lost tax revenue and costs for social services, heath care and criminal justice.
Youth disconnection can be described as a continuum, as some youths will be disconnected for a brief time, while others are chronically disconnected. Additionally, while an individual who is out of school and work and also has poor support from the relationships of others may be further disconnected than an individual who has social support.
Here is an illustration of risk factors, protective factors and the continuum of disconnection:
Many programs have identified useful strategies in re-engaging disconnected youth or preventing disconnection of youth.
Generally speaking, most programs focus on re-engagement strategies, however, prevention strategies are likely to be just as important.
Research suggests that active involvement with at risk youth from infancy and across multiple developmental stages through young adulthood would be the most beneficial.
In fact, the quality of parental care-giving of infants age 6-24 months has actually been shown to be a predictor of high school dropout rates! Thus early interventions may be very important and consistent continual engagement may prevent further disconnection of youths.
Prevention strategies include:
Want to learn more about how to prevent and mitigate youth disconnection?
Or do you know youths who are disconnected?
See the program directory at youth.gov and this program listing focused on Maryland but including other locations for listings of programs dedicated to re-engaging disconnected youth or preventing disconnection.
Also, see The Center for Communities That Care and the PROSPER program for particular examples.
The statistics used in this section came from this article.
There are some important considerations regarding this data analysis to keep in mind:
This data used in the Measure of America project is derived from American Community Survey (ASC), which excludes or under-represents certain opportunity youth groups, such as youths in the juvenile justice system, youths in the foster care system, and homeless youths as the survey is conducted on households. Furthermore, youths who may be more disconnected for other reasons besides not being in work or school, such as dealing with the added challenge of being a teenage mother, or being abused are not available in this dataset. Thus, this data likely underestimates youth disconnection rates. Furthermore, the data used in these reports are collected from human participants; this presents the potential for response bias, as there is the potential that participants in the sampling frame may for a variety of reasons report inaccurate information.
Data about certain group intersections (meaning for example individuals of a particular gender and ethnicity) or particular groups in general such as specific ethnicities or gender or sexual identity groups such as LGBTQIA+ (lesbian/gay/bisexual/pansexual/transgender/genderqueer/queer,questioning/intersexed/agender/asexual) is unfortunately not available in the data used in this analysis and in most research about this topic. Luckily however, recent years of the ACS survey has more detailed information about a greater number of racial and ethnic groups and racial/ethnic intersections.
The statistical procedures we are using may be overly simplistic. In all data analysis, we need to be wary about deriving meaning from the statistical procedures we use. Furthermore, the data we are using is summary level showing averages for large groups. Thus, it is difficult to know how reliable the summary estimates are, as we don’t have information about the variability of the data within a group.
Using image processing tools can be very helpful. The manner in which data is obtained with image processing tools is what we would describe as a black box process, a process with known inputs and outputs but unknown mechanics. Because we are unaware of how our outputs are generated from our inputs, we need to be wary of the output. With the small output we are creating in this case study, a visual inspection should suffice.
In this case study we will be using data related to youth disconnection from the two following reports from the Measure of America project:
Measure of America is a nonpartisan project of the nonprofit Social Science Research Council founded in 2007 to create easy-to-use yet methodologically sound tools for understanding well-being and opportunity in America. Through reports, interactive apps, and custom-built dashboards, Measure of America works with partners to breathe life into numbers, using data to identify areas of highest need, pinpoint levers for change, and track progress over time.
The data used in these reports comes from the American Community Survey (ASC), which is the largest survey conducted by the United States Census Bureau. The survey started in 2005 and collects data for 3.5 million households annually. Data is collected about ancestry, citizenship, income, employment, disability among many other aspects. See here for more detailed information about the survey.
According to Wikipedia:
Data is collected by internet, mail, telephone interviews and in-person interviews… About 95 percent of households across all response modes ultimately respond… ACS responses are confidential"
It is a mandatory survey, it is governed by federal laws that could impose a fine of as much as $5,000 for refusing to participate.
We are particularly interested in the following tables on the last page of the Measure of America 2019 report:
Also, we are particularly interested in the tables on the following pages from the Measure of America 2020 report:
pdftoolsOne way to import data from a pdf is to use the pdf_text() function of the pdftools package. The here() function in the here package allows us to specify the path or location of the document that we want to import, starting from the directory where a .Rproj file is located. In this case, we will import the Making_the_Connection.pdf in the docs directory. (Note this is only the case if you pull the repository from github.)
We can take a look at the output for the page with our table of interests by simply using brackets [] around the page number. The page we are interested in (although labeled 39 in the report) is the 44th page, which looks like this:
[1] "Youth Disconnection by Gender and by Race and Ethnicity\n I NDI CATOR TA BLE S\n DISCONNECTED YOUTH\nMAJOR RACIAL AND RATE (% ages 16–24) 2017 CHANGE IN RATE\nETHNIC GROUPS 2008 2010 2012 2014 2016 (%) (#) 2010–2017 (%)\nUnited States 12.6 14.7 14.1 13.2 11.7 11.5 4,501,800 -22.1\nMale 12.3 15.2 14.5 13.3 12.1 11.8 2,382,500 -22.5\nFemale 12.9 14.1 13.7 13.0 11.2 11.1 2,119,400 -21.7\nASIAN 7.1 8.5 7.8 7.9 6.6 6.6 145,600 -21.7\nAsian Male 6.3 8.3 7.4 7.2 6.7 6.5 73,000 -21.4\nAsian Female 7.9 8.6 8.1 8.6 6.6 6.7 72,600 -22.0\nWHITE 9.7 11.7 11.2 10.8 9.7 9.4 1,961,700 -20.1\nWhite Male 9.5 12.3 11.5 10.8 10.0 9.6 1,031,200 -22.4\nWhite Female 10.0 11.1 10.8 10.7 9.4 9.1 930,600 -17.4\nLATINO 16.7 18.5 17.3 15.2 13.7 13.2 1,157,300 -28.7\nLatino Male 13.6 16.8 16.0 14.0 12.6 12.4 562,600 -26.0\nLatina Female 20.2 20.3 18.8 16.5 14.8 13.9 594,700 -31.5\nBLACK 20.4 22.5 22.4 20.6 17.2 17.9 999,700 -20.6\nBlack Male 23.7 26.0 25.6 23.5 20.1 20.8 591,600 -19.8\nBlack Female 17.0 19.0 19.3 17.6 14.2 14.8 408,000 -22.1\nNATIVE AMERICAN 24.4 28.8 27.0 26.3 25.8 23.9 67,700 -17.1\nNative American Male 25.0 30.9 28.0 26.9 28.1 23.3 33,200 -24.5\nNative American Female 23.9 26.7 25.9 25.6 23.4 24.5 34,500 -8.4\n 2017 2017\n ASIAN SUBGROUPS (%) (#) LATINO SUBGROUPS (%) (#)\n United States 11.5 4,501,800\n Male 11.8 2,382,500 LATINO 13.2 1,157,300\n Female 11.1 2,119,400 Latino Male 12.4 562,600\n ASIAN 6.6 145,600 Latina Female 13.9 594,700\n Asian Male 6.5 73,000 SOUTH AMERICAN 8.4 37,600\n Asian Female 6.7 72,600 South American Male 9.1 20,400\n CHINESE 4.3 23,800 South American Female 7.7 17,200\n Chinese Male 4.7 12,700 CENTRAL AMERICAN 12.0 93,100\n Chinese Female 3.9 11,100 Central American Male 9.3 37,900\n VIETNAMESE 5.5 13,500 Central American Female 15.0 55,200\n Vietnamese Male 7.5 9,300\n MEXICAN 13.3 762,400\n Vietnamese Female 3.4 4,200\n Mexican Male 12.2 358,200\nINDIAN 5.9 22,300\n Mexican Female 14.4 404,200\nIndian Male 4.1 8,000 OTHER LATINO 13.6 44,800\nIndian Female 7.8 14,300\n Other Latino Male 15.3 27,600\n PAKISTANI 6.4 4,900\n Other Latina Female 11.5 17,300\n Pakistani Male\n PUERTO RICAN, DOMINICAN, CUBAN 15.1 211,200\n Pakistani Female\n PR, DR, Cuban Female 15.7 114,500\n KOREAN 6.5 11,200\n PR, DR, Cuban Female 14.4 96,600\n Korean Male 8.0 6,900\n Korean Female 5.0 4,200\n NOTE: Blank cells indicate the estimate is unreliable\n TWO OR MORE 6.6 4,000\n Two or More Male\n Two or More Female\nFILIPINO 7.3 23,400\nFilipino Male 6.5 10,800\nFilipino Female 8.1 12,700\nHMONG 14.0 8,300\nHmong Male 18.6 5,700\nHmong Female\nMAKING THE CONNECTION | Transportation and Youth Disconnection 39\n"
From the output, it’s clear that a relatively large amount of manipulation will be required to wrangle this data. If you are interested in learning more about the pdftools package, please see this case study and this case study.
Question Opportunity
If are familiar with pdftools, spend 3-5 minutes trying to extract the multiple tables from page 44.
While you might find the question above to not be impossible using the pdftools package, you might find it a bit challenging because of how the multiple tables are displayed on this page.
While our output may be reproducible, this process may be too time consuming.
Let’s consider alternative approaches to importing these data.
Next, we will demonstrate how to produce reproducible tables with image processing software in R using a package called magick which allows for the extraction of text from images. The advantage of this option, is that we can take a screenshot of just a piece of the page to wrangle.
magickWe will now import the data using the magick package which allows for the importation of data from images.
First, we will take a screenshot (specifically create a .png file) of the top part of the gender, race, and ethnicity table on the last page of the 2019 Measure of America Report. We are only interested in the percentage of disconnection across the years, so we don’t need our screenshot to include the last couple of columns.
Here, we show what this file looks like in this rendered rmarkdown website by using the include_graphics() function of the knitr package.
Next, we use the image_read() function of the magick package to import this image.
Then, we use the image_info() function to make sure that the import worked and to get information about the size, format and color of the image.
Major_racial_ethnic_groups <-
magick::image_read(here::here("img", "Major_ethnic_groups_screenshot.png"))
magick::image_info(Major_racial_ethnic_groups)# A tibble: 1 x 7
format width height colorspace matte filesize density
<chr> <int> <int> <chr> <lgl> <int> <chr>
1 PNG 1086 867 sRGB TRUE 368488 72x72
We get some basic information, such as the file format is a PNG, the width and height of the image, etc. Let’s take a closer look at our image in R! Now that we have imported it to see this image, we simply need to type the name of the image.
Nice!
Let’s import a couple more images just for fun. Here we will import an image directly from a URL.
ggplot2_url <- "https://d33wubrfki0l68.cloudfront.net/2c6239d311be6d037c251c71c3902792f8c4ddd2/12f67/css/images/hex/ggplot2.png"
ggplot2_logo <- image_read(ggplot2_url)
ggplot2_logoNow we will use the image_ocr() function of the magick package to extract the text from the OCS logo image. This function uses the tesseract package which has tools for optical character recognition (OCR), hence the ocr in the function name. This allows the function to identify text in images. These OCR tools have often been developed using machine learning in which an algorithm was trained on images with and without text to “learn” to recognize text. See here to learn more about how OCR works.
[1] "ggplot2\n"
Awesome! We were able to extract text from this hex sticker!
One thing to keep in mind is that this doesn’t always work. Unusual font, angles text, or particular colors can be difficult for the OCR to recognize.
Here is an example that does not work with the current version of magick:
tidyverse_url <- "https://tidyverse.tidyverse.org/logo.png"
tidyverse_logo <- image_read(tidyverse_url)
tidyverse_logo[1] ""
This is likely do to the background on this particular hex sticker. So sometimes this process requires a bit of trial and error.
Now let’s try extracting the text from our image files.
The first image we imported looks like this.
Now we will extract the text!
[1] "United States 12.6 14.7 14.1 13.2 11.7 11.5\nMale 12.3 15.2 14.5 13.3 12.1 11.8\nFemale 12.9 14.1 13.7 13.0 11.2 11.1\nASIAN 7.1 8.5 78 79 6.6 6.6\nAsian Male 6.3 8.3 74 7.2 6.7 6.5\nAsian Female 7.9 8.6 8.1 8.6 6.6 6.7\nWHITE 9.7 11.7 11.2 10.8 9.7 9.4\nWhite Male 9.5 12.3 11.5 10.8 10.0 9.6\nWhite Female 10.0 11.1 10.8 10.7 9.4 9.1\nLATINO 16.7 18.5 17.3 15.2 13.7 13.2\nLatino Male 13.6 16.8 16.0 14.0 12.6 12.4\nLatina Female 20.2 20.3 18.8 16.5 14.8 13.9\nBLACK 20.4 22.5 22.4 20.6 17.2 17.9\nBlack Male 23.7 26.0 25.6 23.5 20.1 20.8\nBlack Female 17.0 19.0 19.3 17.6 14.2 14.8\nNATIVE AMERICAN 24.4 28.8 27.0 26.3 25.8 23.9\nNative American Male 25.0 30.9 28.0 26.9 28.1 23.3\nNative American Female 23.9 26.7 25.9 25.6 23.4 24.5\n"
This looks like it worked fairly well!
We appear to have lost the column names (likely because of the different background color) but the values look pretty good.
You may notice that there are lots of \n values in the text from our image. These are newline characters, which denote the end of a line of text and the start of a new line of text.
To deal with these, we use the str_split() function of the stringr package to split based on the \n characters in the output, which turns out to be a list. Then, we can “unlist” the output using the base R unlist() function. By base R, we mean that the function it is loaded automatically in an R session. Finally, we use the as_tibble() function of the tibble package to convert the data into tibble format, which is the tidyverse version of a data frame. This will allow us to see the values in the table much better.
To do all of these sequential steps efficiently we will use a method called piping.
Click here if you are unfamiliar with piping in R, which uses this
%>% operator.
%>% pipe operator which is accessible after loading the tidyverse or several of the packages within the tidyverse like dplyr because they load the magrittr package. This allows us to perform multiple sequential steps on one data input.
major_groups <-
major_groups %>%
stringr::str_split(pattern ="\n") %>%
unlist() %>%
tibble::as_tibble()# A tibble: 19 x 1
value
<chr>
1 "United States 12.6 14.7 14.1 13.2 11.7 11.5"
2 "Male 12.3 15.2 14.5 13.3 12.1 11.8"
3 "Female 12.9 14.1 13.7 13.0 11.2 11.1"
4 "ASIAN 7.1 8.5 78 79 6.6 6.6"
5 "Asian Male 6.3 8.3 74 7.2 6.7 6.5"
6 "Asian Female 7.9 8.6 8.1 8.6 6.6 6.7"
7 "WHITE 9.7 11.7 11.2 10.8 9.7 9.4"
8 "White Male 9.5 12.3 11.5 10.8 10.0 9.6"
9 "White Female 10.0 11.1 10.8 10.7 9.4 9.1"
10 "LATINO 16.7 18.5 17.3 15.2 13.7 13.2"
11 "Latino Male 13.6 16.8 16.0 14.0 12.6 12.4"
12 "Latina Female 20.2 20.3 18.8 16.5 14.8 13.9"
13 "BLACK 20.4 22.5 22.4 20.6 17.2 17.9"
14 "Black Male 23.7 26.0 25.6 23.5 20.1 20.8"
15 "Black Female 17.0 19.0 19.3 17.6 14.2 14.8"
16 "NATIVE AMERICAN 24.4 28.8 27.0 26.3 25.8 23.9"
17 "Native American Male 25.0 30.9 28.0 26.9 28.1 23.3"
18 "Native American Female 23.9 26.7 25.9 25.6 23.4 24.5"
19 ""
OK, this looks pretty good!
The only issue is that some values appear to be missing a decimal point.
No worries though, we can modify the entire table in a reproducible way to get those decimal places back. However, first we need to do some other wrangling steps first.
First, let’s separate the first column about ethnicities with the values in the subsequent columns. We can do so using the separate() function of the tidyr package based on regular expressions. Regular expressions (abbreviated regex) are notation shortcuts that describe patterns in character strings. See here for an RStudio cheetsheat about them.
We want to separate by instances where a letter is followed by a space and then a number.
We can specify any letter by using the regex [:alpha:] notation and any number by using the regex [:digit] notation. We could have listed every letter that we saw the first column ending with like so s|e|E|O|K|N but this would not be as reproducible (meaning maybe this would not work as well next year if a new group were added that ended in a different letter), and we might make a mistake. This is why the regex are so useful.
We can indicate that we want a space by using this regex:
Now to specify that we want to see a letter first followed by a space, followed by a digit, we need to use a look around:
We will use the “preceded by” and “followed by” look arounds. Thus (?<=[:alpha:]) stands for any letter that appears before a space \\s that is followed by any digit (?=[0-9]). Altogether the pattern we want to separate by looks like this: "(?<=[:alpha:])\\s(?=[0-9])".
Now to separate the value column into two columns, we can use the separate() function of the tidyr package to do this. This will allow us to not only split the rows by our regex expression, but also to create column names.
There are three important arguments for the separate() function:
- col - this specifies what column you are separating
- into - this specifies the names of the new columns you are creating
- sep - this specifies what character string to look for to separate by
Thus we will separate the value column into Group and years columns.
major_groups <-
major_groups %>%
tidyr::separate(col = value,
into = c("Group", "Years"),
sep = "(?<=[:alpha:])\\s(?=[0-9])")
major_groups# A tibble: 19 x 2
Group Years
<chr> <chr>
1 "United States" 12.6 14.7 14.1 13.2 11.7 11.5
2 "Male" 12.3 15.2 14.5 13.3 12.1 11.8
3 "Female" 12.9 14.1 13.7 13.0 11.2 11.1
4 "ASIAN" 7.1 8.5 78 79 6.6 6.6
5 "Asian Male" 6.3 8.3 74 7.2 6.7 6.5
6 "Asian Female" 7.9 8.6 8.1 8.6 6.6 6.7
7 "WHITE" 9.7 11.7 11.2 10.8 9.7 9.4
8 "White Male" 9.5 12.3 11.5 10.8 10.0 9.6
9 "White Female" 10.0 11.1 10.8 10.7 9.4 9.1
10 "LATINO" 16.7 18.5 17.3 15.2 13.7 13.2
11 "Latino Male" 13.6 16.8 16.0 14.0 12.6 12.4
12 "Latina Female" 20.2 20.3 18.8 16.5 14.8 13.9
13 "BLACK" 20.4 22.5 22.4 20.6 17.2 17.9
14 "Black Male" 23.7 26.0 25.6 23.5 20.1 20.8
15 "Black Female" 17.0 19.0 19.3 17.6 14.2 14.8
16 "NATIVE AMERICAN" 24.4 28.8 27.0 26.3 25.8 23.9
17 "Native American Male" 25.0 30.9 28.0 26.9 28.1 23.3
18 "Native American Female" 23.9 26.7 25.9 25.6 23.4 24.5
19 "" <NA>
Looks good!
Let’s also get rid of the all caps for the major categories of the Group column. We can convert the words to only capitalize the first letter using the str_to_title() function of the stringr package. To specifically modify the Group column we can use the mutate function of the dplyr package.
We are also going to use a special pipe operator (%<>%) from the magrittr package called the compound assignment pipe-operator or sometimes the double pipe operator.
This allows us to use the major_groups as our input and reassign it at the end after all the subsequent steps have been performed, although in this case it is only one step.
# A tibble: 19 x 2
Group Years
<chr> <chr>
1 "United States" 12.6 14.7 14.1 13.2 11.7 11.5
2 "Male" 12.3 15.2 14.5 13.3 12.1 11.8
3 "Female" 12.9 14.1 13.7 13.0 11.2 11.1
4 "Asian" 7.1 8.5 78 79 6.6 6.6
5 "Asian Male" 6.3 8.3 74 7.2 6.7 6.5
6 "Asian Female" 7.9 8.6 8.1 8.6 6.6 6.7
7 "White" 9.7 11.7 11.2 10.8 9.7 9.4
8 "White Male" 9.5 12.3 11.5 10.8 10.0 9.6
9 "White Female" 10.0 11.1 10.8 10.7 9.4 9.1
10 "Latino" 16.7 18.5 17.3 15.2 13.7 13.2
11 "Latino Male" 13.6 16.8 16.0 14.0 12.6 12.4
12 "Latina Female" 20.2 20.3 18.8 16.5 14.8 13.9
13 "Black" 20.4 22.5 22.4 20.6 17.2 17.9
14 "Black Male" 23.7 26.0 25.6 23.5 20.1 20.8
15 "Black Female" 17.0 19.0 19.3 17.6 14.2 14.8
16 "Native American" 24.4 28.8 27.0 26.3 25.8 23.9
17 "Native American Male" 25.0 30.9 28.0 26.9 28.1 23.3
18 "Native American Female" 23.9 26.7 25.9 25.6 23.4 24.5
19 "" <NA>
Nice! That looks better.
For the year data we would like to try splitting the strings for each row into different columns based on a space. Currently all the data is listed in one column called Years.
We can use the separate() function of the tidyr package again to do this. This will allow us to split the rows by spaces, as well as provide names for the new columns.
major_groups %<>%
tidyr::separate(col = Years,
into = c("2008", "2010", "2012", "2014", "2016", "2017"),
sep = " ")
major_groups# A tibble: 19 x 7
Group `2008` `2010` `2012` `2014` `2016` `2017`
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 "United States" 12.6 14.7 14.1 13.2 11.7 11.5
2 "Male" 12.3 15.2 14.5 13.3 12.1 11.8
3 "Female" 12.9 14.1 13.7 13.0 11.2 11.1
4 "Asian" 7.1 8.5 78 79 6.6 6.6
5 "Asian Male" 6.3 8.3 74 7.2 6.7 6.5
6 "Asian Female" 7.9 8.6 8.1 8.6 6.6 6.7
7 "White" 9.7 11.7 11.2 10.8 9.7 9.4
8 "White Male" 9.5 12.3 11.5 10.8 10.0 9.6
9 "White Female" 10.0 11.1 10.8 10.7 9.4 9.1
10 "Latino" 16.7 18.5 17.3 15.2 13.7 13.2
11 "Latino Male" 13.6 16.8 16.0 14.0 12.6 12.4
12 "Latina Female" 20.2 20.3 18.8 16.5 14.8 13.9
13 "Black" 20.4 22.5 22.4 20.6 17.2 17.9
14 "Black Male" 23.7 26.0 25.6 23.5 20.1 20.8
15 "Black Female" 17.0 19.0 19.3 17.6 14.2 14.8
16 "Native American" 24.4 28.8 27.0 26.3 25.8 23.9
17 "Native American Male" 25.0 30.9 28.0 26.9 28.1 23.3
18 "Native American Female" 23.9 26.7 25.9 25.6 23.4 24.5
19 "" <NA> <NA> <NA> <NA> <NA> <NA>
Looks pretty good!
We appear to have an empty row at the very end. Since all the values are NA, we can use the drop_na() function of the tidyr package to remove it.
# A tibble: 18 x 7
Group `2008` `2010` `2012` `2014` `2016` `2017`
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 United States 12.6 14.7 14.1 13.2 11.7 11.5
2 Male 12.3 15.2 14.5 13.3 12.1 11.8
3 Female 12.9 14.1 13.7 13.0 11.2 11.1
4 Asian 7.1 8.5 78 79 6.6 6.6
5 Asian Male 6.3 8.3 74 7.2 6.7 6.5
6 Asian Female 7.9 8.6 8.1 8.6 6.6 6.7
7 White 9.7 11.7 11.2 10.8 9.7 9.4
8 White Male 9.5 12.3 11.5 10.8 10.0 9.6
9 White Female 10.0 11.1 10.8 10.7 9.4 9.1
10 Latino 16.7 18.5 17.3 15.2 13.7 13.2
11 Latino Male 13.6 16.8 16.0 14.0 12.6 12.4
12 Latina Female 20.2 20.3 18.8 16.5 14.8 13.9
13 Black 20.4 22.5 22.4 20.6 17.2 17.9
14 Black Male 23.7 26.0 25.6 23.5 20.1 20.8
15 Black Female 17.0 19.0 19.3 17.6 14.2 14.8
16 Native American 24.4 28.8 27.0 26.3 25.8 23.9
17 Native American Male 25.0 30.9 28.0 26.9 28.1 23.3
18 Native American Female 23.9 26.7 25.9 25.6 23.4 24.5
Great, now we have 18 rows.
It’s important to look very carefully at the text. Again, there are some values missing a decimal place. For example the row where the Group value is Asian, the third and fourth values are missing a decimal place.
Looking at the original table we see that even values like 10 are represented as 10.0.
We can also see that the number values are of class character based on the <chr> at the top of each column.
Click here for an explanation about data types in R and about character strings.
There are several classes of data in R programming. Character is one of these classes. A character string is an individual data value made up of characters. This can be a paragraph, like the legend for the table, or it can be a single letter or number like the letter "a" or the number "3". If data are of class character, than the numeric values will not be processed like a numeric value in a mathematical sense. If you want your numeric values to be interpreted that way, they need to be converted to a numeric class. The options typically used are integer (which has no decimal place) and double precision (which has a decimal place).
To fix this, we will first remove all decimals by taking advantage of the fact that these values are character strings (this is also sort of like multiplying all values that do have a decimal by 10). Next we will convert the values to be numeric and finally we will multiply all values by .01 to add the decimals back.
Here, we use the mutate() function combined with the across() function, which allows us to specify the columns we want to perform a function on. We want to do this for all the year columns, so we can exclude the Group column by using a minus sign operator - in the .cols argument of the across() function like so: mutate(across(.cols = -Group))
Finally, we will use the str_remove() function of the stringr package to find instances of “.” and remove them. Since “.” is a regex, and indicates any character string, thus we need “\” to have R interpret a decimal or a period instead, as we can see from the RStudio cheat sheets:
To pass the data from all the columns except our Group variable into our str_remove() function, we need to use the . notation as a replacement for the data that we specified by the .colsargument and we need to use ~ in front of the function name.
major_groups %<>%
mutate(across(.cols = -Group,
~str_remove(string = ., pattern = "\\.")))
major_groups# A tibble: 18 x 7
Group `2008` `2010` `2012` `2014` `2016` `2017`
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 United States 126 147 141 132 117 115
2 Male 123 152 145 133 121 118
3 Female 129 141 137 130 112 111
4 Asian 71 85 78 79 66 66
5 Asian Male 63 83 74 72 67 65
6 Asian Female 79 86 81 86 66 67
7 White 97 117 112 108 97 94
8 White Male 95 123 115 108 100 96
9 White Female 100 111 108 107 94 91
10 Latino 167 185 173 152 137 132
11 Latino Male 136 168 160 140 126 124
12 Latina Female 202 203 188 165 148 139
13 Black 204 225 224 206 172 179
14 Black Male 237 260 256 235 201 208
15 Black Female 170 190 193 176 142 148
16 Native American 244 288 270 263 258 239
17 Native American Male 250 309 280 269 281 233
18 Native American Female 239 267 259 256 234 245
Great, now in order to multiply each value by 0.1 we need to first make the values numeric. Currently we can tell that they are character strings based on the <char> values listed under each column name.
To convert our values to be numeric we can use the base as.numeric() function. Again we will use mutate() and across(). Since this function doesn’t require any arguments, we don’t need to specify it’s input like we just did for the str_remove() but we could do so as shown below.
major_groups %<>%
mutate(across(.cols = -Group, as.numeric))
#this is equivalent:
#major_groups %<>%
# mutate(across(.cols = -Group, ~as.numeric(.)))
major_groups# A tibble: 18 x 7
Group `2008` `2010` `2012` `2014` `2016` `2017`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 United States 126 147 141 132 117 115
2 Male 123 152 145 133 121 118
3 Female 129 141 137 130 112 111
4 Asian 71 85 78 79 66 66
5 Asian Male 63 83 74 72 67 65
6 Asian Female 79 86 81 86 66 67
7 White 97 117 112 108 97 94
8 White Male 95 123 115 108 100 96
9 White Female 100 111 108 107 94 91
10 Latino 167 185 173 152 137 132
11 Latino Male 136 168 160 140 126 124
12 Latina Female 202 203 188 165 148 139
13 Black 204 225 224 206 172 179
14 Black Male 237 260 256 235 201 208
15 Black Female 170 190 193 176 142 148
16 Native American 244 288 270 263 258 239
17 Native American Male 250 309 280 269 281 233
18 Native American Female 239 267 259 256 234 245
Great, we can see that the year variables are now numeric as they are now type double as indicated by the <dbl> below each column name. See the above section about data types if you are unfamiliar with type double.
OK, now we can multiply each value by 0.1 to add our decimal points back and get back to the original values.
# A tibble: 18 x 7
Group `2008` `2010` `2012` `2014` `2016` `2017`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 United States 12.6 14.7 14.1 13.2 11.7 11.5
2 Male 12.3 15.2 14.5 13.3 12.1 11.8
3 Female 12.9 14.1 13.7 13 11.2 11.1
4 Asian 7.1 8.5 7.8 7.9 6.6 6.6
5 Asian Male 6.3 8.3 7.4 7.2 6.7 6.5
6 Asian Female 7.9 8.6 8.1 8.6 6.6 6.7
7 White 9.7 11.7 11.2 10.8 9.7 9.4
8 White Male 9.5 12.3 11.5 10.8 10 9.6
9 White Female 10 11.1 10.8 10.7 9.4 9.1
10 Latino 16.7 18.5 17.3 15.2 13.7 13.2
11 Latino Male 13.6 16.8 16 14 12.6 12.4
12 Latina Female 20.2 20.3 18.8 16.5 14.8 13.9
13 Black 20.4 22.5 22.4 20.6 17.2 17.9
14 Black Male 23.7 26 25.6 23.5 20.1 20.8
15 Black Female 17 19 19.3 17.6 14.2 14.8
16 Native American 24.4 28.8 27 26.3 25.8 23.9
17 Native American Male 25 30.9 28 26.9 28.1 23.3
18 Native American Female 23.9 26.7 25.9 25.6 23.4 24.5
Now is a good time to double check that our table looks like what we expect.
Looks good!
We also want to add a couple of variables about Race_Ethnicity and Gender so that we can select across groups later. We can use the recode() function of the dplyr package to change specific values, as we create a new Race_Ethnicity variable from the Group variable. For the Data for all of the US we want the Race_Ethnicity variable values to be "All_races".
major_groups %<>%
mutate(Race_Ethnicity =
recode(Group, "United States" = "All_races",
"Female" = "All_races",
"Male" = "All_races"))
head(major_groups)# A tibble: 6 x 8
Group `2008` `2010` `2012` `2014` `2016` `2017` Race_Ethnicity
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 United States 12.6 14.7 14.1 13.2 11.7 11.5 All_races
2 Male 12.3 15.2 14.5 13.3 12.1 11.8 All_races
3 Female 12.9 14.1 13.7 13 11.2 11.1 All_races
4 Asian 7.1 8.5 7.8 7.9 6.6 6.6 Asian
5 Asian Male 6.3 8.3 7.4 7.2 6.7 6.5 Asian Male
6 Asian Female 7.9 8.6 8.1 8.6 6.6 6.7 Asian Female
We also want to remove Male and Female from this “Race_Ethnicity” variable, We can do so using the str_remove() function of the stringr package. Importantly, we are also removing the space before “Female” and "Male.
major_groups %<>%
mutate(Race_Ethnicity = str_remove(string = Race_Ethnicity,
pattern = " Female| Male"))
head(major_groups)# A tibble: 6 x 8
Group `2008` `2010` `2012` `2014` `2016` `2017` Race_Ethnicity
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 United States 12.6 14.7 14.1 13.2 11.7 11.5 All_races
2 Male 12.3 15.2 14.5 13.3 12.1 11.8 All_races
3 Female 12.9 14.1 13.7 13 11.2 11.1 All_races
4 Asian 7.1 8.5 7.8 7.9 6.6 6.6 Asian
5 Asian Male 6.3 8.3 7.4 7.2 6.7 6.5 Asian
6 Asian Female 7.9 8.6 8.1 8.6 6.6 6.7 Asian
For the new Gender variable we would like to extract just the “Female” and “Male” text from the Group variable. The str_extract() function of the stringr package will do this, and it will give us an NA value for any rows where “Female” or “Male” were not present. We can then replace the NA values with the text “All” to represent the total value for both male and female using the replace_na() function of the tidyr() package.
major_groups %<>%
mutate(Gender = str_extract(string = Group,
pattern = "Female|Male")) %>%
mutate(Gender = replace_na(Gender, replace = "All"))
head(major_groups)# A tibble: 6 x 9
Group `2008` `2010` `2012` `2014` `2016` `2017` Race_Ethnicity Gender
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 United States 12.6 14.7 14.1 13.2 11.7 11.5 All_races All
2 Male 12.3 15.2 14.5 13.3 12.1 11.8 All_races Male
3 Female 12.9 14.1 13.7 13 11.2 11.1 All_races Female
4 Asian 7.1 8.5 7.8 7.9 6.6 6.6 Asian All
5 Asian Male 6.3 8.3 7.4 7.2 6.7 6.5 Asian Male
6 Asian Female 7.9 8.6 8.1 8.6 6.6 6.7 Asian Female
We would also like to replace Latino and Latina with Latinx. We can use another stringr function for this. This function, str_replace() allows us to remove and replace a particular pattern.
Question Opportunity
What other tidyverse functions could you use to replace Latino and Latina with Latinx. Why might the str_replace() function be preferable?
Click here to reveal the answer.
While the recode() function will work in this case, the str_replace() function can be used more simply, as it will look for the pattern of Latino and Latina to replace even if that pattern is part of a value that has more characters. In contrast, the recode and case_when() functions would require that Latinx be specified for the replacement for both Latino and Latina separately, as well as the more complicated replacements for the compound groups like Latinx Male for Latino Male. This is longer and thus creates more opportunities for a mistake.
You could also use the case_when() function of the dplyr package if you are familiar with it. However, this would have the same issues as the recode() function in that it replaces values not patterns. In addition, you would need to make sure that the other values of the Race_Ethnicity and Group variables were not replaced with NA.
major_groups %>%
mutate(Race_Ethnicity =
recode(Race_Ethnicity, "Latino" = "Latinx",
"Latina" = "Latinx"),
Group =
recode(Group, "Latino" = "Latinx",
"Latino Male" = "Latinx Male",
"Latina Female" = "Latinx Female"))# A tibble: 18 x 9
Group `2008` `2010` `2012` `2014` `2016` `2017` Race_Ethnicity Gender
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 United States 12.6 14.7 14.1 13.2 11.7 11.5 All_races All
2 Male 12.3 15.2 14.5 13.3 12.1 11.8 All_races Male
3 Female 12.9 14.1 13.7 13 11.2 11.1 All_races Female
4 Asian 7.1 8.5 7.8 7.9 6.6 6.6 Asian All
5 Asian Male 6.3 8.3 7.4 7.2 6.7 6.5 Asian Male
6 Asian Female 7.9 8.6 8.1 8.6 6.6 6.7 Asian Female
7 White 9.7 11.7 11.2 10.8 9.7 9.4 White All
8 White Male 9.5 12.3 11.5 10.8 10 9.6 White Male
9 White Female 10 11.1 10.8 10.7 9.4 9.1 White Female
10 Latinx 16.7 18.5 17.3 15.2 13.7 13.2 Latinx All
11 Latinx Male 13.6 16.8 16 14 12.6 12.4 Latinx Male
12 Latinx Female 20.2 20.3 18.8 16.5 14.8 13.9 Latinx Female
13 Black 20.4 22.5 22.4 20.6 17.2 17.9 Black All
14 Black Male 23.7 26 25.6 23.5 20.1 20.8 Black Male
15 Black Female 17 19 19.3 17.6 14.2 14.8 Black Female
16 Native Ameri… 24.4 28.8 27 26.3 25.8 23.9 Native Americ… All
17 Native Ameri… 25 30.9 28 26.9 28.1 23.3 Native Americ… Male
18 Native Ameri… 23.9 26.7 25.9 25.6 23.4 24.5 Native Americ… Female
major_groups %<>%
mutate(across(.cols = c(Group, Race_Ethnicity),
~str_replace(string = .,
pattern = "Latino|Latina",
replacement = "Latinx")))# A tibble: 18 x 9
Group `2008` `2010` `2012` `2014` `2016` `2017` Race_Ethnicity Gender
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 United States 12.6 14.7 14.1 13.2 11.7 11.5 All_races All
2 Male 12.3 15.2 14.5 13.3 12.1 11.8 All_races Male
3 Female 12.9 14.1 13.7 13 11.2 11.1 All_races Female
4 Asian 7.1 8.5 7.8 7.9 6.6 6.6 Asian All
5 Asian Male 6.3 8.3 7.4 7.2 6.7 6.5 Asian Male
6 Asian Female 7.9 8.6 8.1 8.6 6.6 6.7 Asian Female
7 White 9.7 11.7 11.2 10.8 9.7 9.4 White All
8 White Male 9.5 12.3 11.5 10.8 10 9.6 White Male
9 White Female 10 11.1 10.8 10.7 9.4 9.1 White Female
10 Latinx 16.7 18.5 17.3 15.2 13.7 13.2 Latinx All
11 Latinx Male 13.6 16.8 16 14 12.6 12.4 Latinx Male
12 Latinx Female 20.2 20.3 18.8 16.5 14.8 13.9 Latinx Female
13 Black 20.4 22.5 22.4 20.6 17.2 17.9 Black All
14 Black Male 23.7 26 25.6 23.5 20.1 20.8 Black Male
15 Black Female 17 19 19.3 17.6 14.2 14.8 Black Female
16 Native Ameri… 24.4 28.8 27 26.3 25.8 23.9 Native Americ… All
17 Native Ameri… 25 30.9 28 26.9 28.1 23.3 Native Americ… Male
18 Native Ameri… 23.9 26.7 25.9 25.6 23.4 24.5 Native Americ… Female
Finally, we would like to change the shape of our table so that we have a new column that represents the year and a new column that represents the value for that year.
To do so we will be making our table “longer”, meaning that it will have fewer columns and more rows. See here for more information about different table formats, typically referred to as wide and long or sometimes narrow.
We will use the pivot_longer() function of the tidyr package to change the shape of our table.
There are 3 main arguments in this function:
cols - specifies what columns to collapsenames_to - specifies the name of the new column that will be created that will contain the column names of the columns you are collapsingvalues_to - specifies the name of the new column that will be created that will contain the values from the columns you are collapsingTo specify that we want to collapse all the columns that have year values, we can chose those that contain the string "20" using the contains() function of dplyr.
major_groups_long <- major_groups %>%
tidyr::pivot_longer(cols = contains("20"),
names_to = "Year",
values_to = "Percent",
names_prefix = "Perc_") %>%
dplyr::mutate(Year = as.numeric(Year))
major_groups_long# A tibble: 108 x 5
Group Race_Ethnicity Gender Year Percent
<chr> <chr> <chr> <dbl> <dbl>
1 United States All_races All 2008 12.6
2 United States All_races All 2010 14.7
3 United States All_races All 2012 14.1
4 United States All_races All 2014 13.2
5 United States All_races All 2016 11.7
6 United States All_races All 2017 11.5
7 Male All_races Male 2008 12.3
8 Male All_races Male 2010 15.2
9 Male All_races Male 2012 14.5
10 Male All_races Male 2014 13.3
# … with 98 more rows
Excellent, now we need to do the same for the other two tables on this page:
Now let’s do the same for the Asian subgroups table.
First we will start by importing a screenshot for this table without the header, as we did before. The name of the file for the screenshot is asian_subgroups.png and it is located in the img directory.
Question Opportunity
Can you recall the command to import an image into R using the magick package?
Click here to reveal the code.
Question Opportunity
Can you recall the command to extract the text from an image using the magick package?
Click here to reveal the code.
[1] "United States 11.5\nMale 11.8\nFemale 11.1\nASIAN 6.8\nAsian Male 65\nAsian Female 67\nCHINESE 43\nChinese Male AT\nChinese Female 3.9\nVIETNAMESE 5.5\nVietnamese Male 75\nVietnamese Female 3.4\nINDIAN 5.9\nIndian Male 4.1\nIndian Female 78\nPAKISTANI 64\nPakistani Male\n\nPakistani Female\n\nKOREAN 65\nKorean Male 8.0\nKorean Female 5.0\nTWO OR MORE 68\nTwo or More Male\n\nTwo or More Female\nFILIPINO. S™S™~SS\nFilipino Male 65\nFilipino Female 81\nHMONG 14.0\nHmong Male 18.6\n"
Now we again want to split the data into rows based on the newline regex. This is something we will continue to do for all the tables.
One option is to copy and paste code we wrote above each time. However, this is not very efficient and is error prone. Alternatively, we can create a R function to accomplish this succinctly. Functions allow us to perform the same process on multiple data inputs. See this other case study for more details about how to write a function.
In general, the process of writing functions involves first specifying an input that is used within the function to create an output. In this case, the data input is text which will be replaced by the actual image text that we are working on, and then used in the subsequent steps to wrangle the data. We will call our function make_rows().
Great! Now let’s apply our function to the asian_subgroups data!
# A tibble: 33 x 1
value
<chr>
1 United States 11.5
2 Male 11.8
3 Female 11.1
4 ASIAN 6.8
5 Asian Male 65
6 Asian Female 67
7 CHINESE 43
8 Chinese Male AT
9 Chinese Female 3.9
10 VIETNAMESE 5.5
# … with 23 more rows
As you can see, there are some strange values for some of the rows. For example the row that starts with CHINESE MAle has AT percentage of disconnected youth, and the row that should say FILIPINO says S™S™~SS.
[[1]]
[1] "Chinese" "Male" "AT"
[[1]]
[1] "FILIPINO." "S™S™~SS"
Notice how there are empty rows. According to the report (PDF), these spaces are empty to denote that the estimates were unreliable for these groups. Since we don’t really need these rows as they are missing values anyway and because images with less text that of higher resolution are often easier for the OCR to interpret correctly, we will now try using three images of this table rather than just one. This allows us to zoom in on the parts of the table that we actually want and hopefully, the OCR will be able to interpret the text correctly.
So we will now import and extract text from three screenshots of this table where we stop just after the row that starts with PAKISTANI in the first image, and then an image of the Korean rows up to the next row with no values, and finally an image starting at the row that starts with FILIPNO. Thus all of the empty rows will not be included.
Asian_sub_A <- image_read(here("img", "asian_sub_A.png"))
Asian_sub_B <- image_read(here("img", "asian_sub_B.png"))
Asian_sub_C <- image_read(here("img", "asian_sub_C.png"))
Asian_sub_AAsian_sub_A <- image_ocr(Asian_sub_A)
Asian_sub_B <- image_ocr(Asian_sub_B)
Asian_sub_C <- image_ocr(Asian_sub_C)
Asian_sub_A <- make_rows(Asian_sub_A)
Asian_sub_B <- make_rows(Asian_sub_B)
Asian_sub_C <- make_rows(Asian_sub_C)
Asian_sub_A # A tibble: 17 x 1
value
<chr>
1 "United States 11.5"
2 "Male 11.8"
3 "Female 11.1"
4 "ASIAN 6.6"
5 "Asian Male 6.5"
6 "Asian Female 6.7"
7 "CHINESE 4.3"
8 "Chinese Male 4.7"
9 "Chinese Female 3.9"
10 "VIETNAMESE 5.5"
11 "Vietnamese Male 7.5"
12 "Vietnamese Female 3.4"
13 "INDIAN 5.9"
14 "Indian Male 4.1"
15 "Indian Female 7.8"
16 "PAKISTANI 6.4"
17 ""
# A tibble: 5 x 1
value
<chr>
1 "KOREAN 6.5"
2 "Korean Male 8.0"
3 "Korean Female 5.0"
4 "TWO OR MORE 6.6"
5 ""
# A tibble: 6 x 1
value
<chr>
1 "FILIPINO 7.3"
2 "Filipino Male 6.5"
3 "Filipino Female 8.1"
4 "HMONG 14.0"
5 "Hmong Male 18.6"
6 ""
Much better!
We can now combine the objects with the bind_rows() function of the dplyr package, which will append each of these tibbles together one after the other.
# A tibble: 28 x 1
value
<chr>
1 United States 11.5
2 Male 11.8
3 Female 11.1
4 ASIAN 6.6
5 Asian Male 6.5
6 Asian Female 6.7
7 CHINESE 4.3
8 Chinese Male 4.7
9 Chinese Female 3.9
10 VIETNAMESE 5.5
# … with 18 more rows
Looks pretty good!
Now we have similar wrangling steps to perform as we did previously and we will need to do the same for the Latinx subgroups table. So it is a good idea to make another function.
Even though we appear to have all of the decimal places for the values, we will include this in our function, just to make sure the data is correct.
Question Opportunity
Can you explain what each of the commands are doing within the function?
clean_table <- function(table){
table %>%
separate(., col = value,
into = c("Group", "Percentage"),
sep = "(?<=[:alpha:])\\s(?=[0-9])") %>%
drop_na() %>%
mutate(Group = str_to_title(Group)) %>%
mutate(Percentage = str_remove(string = Percentage,
pattern = "\\.")) %>%
separate(Percentage, c("Percent"), sep = " ") %>%
mutate(Percent = as.numeric(Percent)) %>%
mutate(Percent = Percent * 0.1) %>%
mutate(Race_Ethnicity = recode(Group,
"United States" = "All_races",
"Female" = "All_races",
"Male" = "All_races")) %>%
mutate(Race_Ethnicity = str_remove(string = Race_Ethnicity,
pattern = " Female| Male")) %>%
mutate(Gender = str_extract(string = Group,
pattern ="Female|Male")) %>%
mutate(Gender = replace_na(Gender, replace = "All"))
}# A tibble: 25 x 4
Group Percent Race_Ethnicity Gender
<chr> <dbl> <chr> <chr>
1 United States 11.5 All_races All
2 Male 11.8 All_races Male
3 Female 11.1 All_races Female
4 Asian 6.6 Asian All
5 Asian Male 6.5 Asian Male
6 Asian Female 6.7 Asian Female
7 Chinese 4.3 Chinese All
8 Chinese Male 4.7 Chinese Male
9 Chinese Female 3.9 Chinese Female
10 Vietnamese 5.5 Vietnamese All
# … with 15 more rows
Great! This looks as we expected.
Question Opportunity
Why do we not need to use pivot_longer() with this data?
Recall that this is the table we want to wrangle:
Question Opportunity
Do you notice anything incorrect about this table? It’s always important to be pay attention to the data even when it is in raw format.
Click here to see what is incorrect.
The last two rows both say “Female”.
Sometimes when wrangling text data, we will come across a typo. We need to determine how to respond to the typo and make note of it. It’s often best to consult a secondary source to confirm that changes made are accurate.
For the purposes of this case study, we will assume that the first of the two rows represents male disconnection rates in the Latino/a subgroup; this would be consistent with the ordering of genders in the previous subgroups.
We will make sure to correct this typo when we can.
After trial and error, two screenshots were determined best for importing this data. The names of the files for the screenshots are latinx_sub_A.png and "latinx_sub_B.png. They are located in the img directory.
Question Opportunity
Can you recall the commands to import and extract the data?
Click here to reveal the code.
latinx_imageA <- image_read(here::here("img", "latinx_sub_A.png"))
latinx_imageB <- image_read(here::here("img", "latinx_sub_B.png"))
latinx_imageC <- image_read(here::here("img", "latinx_sub_C.png"))
latinx_A <- image_ocr(latinx_imageA)
latinx_B <- image_ocr(latinx_imageB)
latinx_C <- image_ocr(latinx_imageC)[1] "LATINO 13.2\nLatino Male 12.4\nLatina Female 13.9\n"
[1] "SOUTH AMERICAN 8.4\nSouth American Male 9.1\nSouth American Female 7.7\nCENTRAL AMERICAN 12.0\nCentral American Male 9.3\nCentral American Female 15.0\nMEXICAN 13.3\nMexican Male 12.2\nMexican Female 14.4\nOTHER LATINO 13.6\nOther Latino Male 15.3\nOther Latina Female 11.5\n"
[1] "PUERTO RICAN, DOMINICAN, CUBAN 15.1\nPR, DR, Cuban Female 15.7\nPR, DR, Cuban Female 14.4\n"
We can combine the strings together using the str_c() function (which stands for string collapse) of the stringr package.
[1] "LATINO 13.2\nLatino Male 12.4\nLatina Female 13.9\nSOUTH AMERICAN 8.4\nSouth American Male 9.1\nSouth American Female 7.7\nCENTRAL AMERICAN 12.0\nCentral American Male 9.3\nCentral American Female 15.0\nMEXICAN 13.3\nMexican Male 12.2\nMexican Female 14.4\nOTHER LATINO 13.6\nOther Latino Male 15.3\nOther Latina Female 11.5\nPUERTO RICAN, DOMINICAN, CUBAN 15.1\nPR, DR, Cuban Female 15.7\nPR, DR, Cuban Female 14.4\n"
Now let’s correct that typo.
Question Opportunity
How might you do this?
Click here to reveal the code.
[1] "LATINO 13.2\nLatino Male 12.4\nLatina Female 13.9\nSOUTH AMERICAN 8.4\nSouth American Male 9.1\nSouth American Female 7.7\nCENTRAL AMERICAN 12.0\nCentral American Male 9.3\nCentral American Female 15.0\nMEXICAN 13.3\nMexican Male 12.2\nMexican Female 14.4\nOTHER LATINO 13.6\nOther Latino Male 15.3\nOther Latina Female 11.5\nPUERTO RICAN, DOMINICAN, CUBAN 15.1\nPR, DR, Cuban Male 15.7\nPR, DR, Cuban Female 14.4\n"
Question Opportunity
Can you recall the commands within our make_rows() function to separate the data into rows and create a tibble?
Click here to reveal the code.
# A tibble: 19 x 1
value
<chr>
1 "LATINO 13.2"
2 "Latino Male 12.4"
3 "Latina Female 13.9"
4 "SOUTH AMERICAN 8.4"
5 "South American Male 9.1"
6 "South American Female 7.7"
7 "CENTRAL AMERICAN 12.0"
8 "Central American Male 9.3"
9 "Central American Female 15.0"
10 "MEXICAN 13.3"
11 "Mexican Male 12.2"
12 "Mexican Female 14.4"
13 "OTHER LATINO 13.6"
14 "Other Latino Male 15.3"
15 "Other Latina Female 11.5"
16 "PUERTO RICAN, DOMINICAN, CUBAN 15.1"
17 "PR, DR, Cuban Male 15.7"
18 "PR, DR, Cuban Female 14.4"
19 ""
Now we can apply our function.
# A tibble: 18 x 4
Group Percent Race_Ethnicity Gender
<chr> <dbl> <chr> <chr>
1 Latino 13.2 Latino All
2 Latino Male 12.4 Latino Male
3 Latina Female 13.9 Latina Female
4 South American 8.4 South American All
5 South American Male 9.1 South American Male
6 South American Female 7.7 South American Female
7 Central American 12 Central American All
8 Central American Male 9.3 Central American Male
9 Central American Female 15 Central American Female
10 Mexican 13.3 Mexican All
11 Mexican Male 12.2 Mexican Male
12 Mexican Female 14.4 Mexican Female
13 Other Latino 13.6 Other Latino All
14 Other Latino Male 15.3 Other Latino Male
15 Other Latina Female 11.5 Other Latina Female
16 Puerto Rican, Dominican, Cuban 15.1 Puerto Rican, Dominican, Cuban All
17 Pr, Dr, Cuban Male 15.7 Pr, Dr, Cuban Male
18 Pr, Dr, Cuban Female 14.4 Pr, Dr, Cuban Female
It looks like we’ve successfully corrected the typo!
Let’s also replace the abbreviations for Puerto Rican and Dominican and let’s replace Latino/Latina with Latinx.
Question Opportunity
How might you do this using the str_replace() function?
Click here to reveal the code.
# A tibble: 18 x 4
Group Percent Race_Ethnicity Gender
<chr> <dbl> <chr> <chr>
1 Latinx 13.2 Latinx All
2 Latinx Male 12.4 Latinx Male
3 Latinx Female 13.9 Latinx Female
4 South American 8.4 South American All
5 South American Male 9.1 South American Male
6 South American Female 7.7 South American Female
7 Central American 12 Central American All
8 Central American Male 9.3 Central American Male
9 Central American Female 15 Central American Female
10 Mexican 13.3 Mexican All
11 Mexican Male 12.2 Mexican Male
12 Mexican Female 14.4 Mexican Female
13 Other Latinx 13.6 Other Latinx All
14 Other Latinx Male 15.3 Other Latinx Male
15 Other Latinx Female 11.5 Other Latinx Female
16 Puerto Rican, Dominican, Cuban 15.1 Puerto Rican, Dominican, Cu… All
17 Puerto Rican, Dominican, Cuban M… 15.7 Puerto Rican, Dominican, Cu… Male
18 Puerto Rican, Dominican, Cuban F… 14.4 Puerto Rican, Dominican, Cu… Female
Great!
Now we are ready to look at the data from 2018 for the Asian and Latinx subgroups from the other report.
Recall that this was the page with the table of interest for the Asian subgroups with 2018 data:
As you can see, the data for the subgroups is shown in the table but the overall data for Asians is located in the text.
We will use a screenshot of each to extract the data for this year.
Trial and error indicated that again dividing the table into multiple screenshots improved the text extraction:
asian_sub_2018_A <- image_read(here::here("img", "asian_sub_2018_A.png"))
asian_sub_2018_A <- image_ocr(asian_sub_2018_A)
asian_sub_2018_B <- image_read(here::here("img", "asian_sub_2018_B.png"))
asian_sub_2018_B <- image_ocr(asian_sub_2018_B)
asian_sub_2018 <-str_c(asian_sub_2018_A, asian_sub_2018_B)# A tibble: 23 x 1
value
<chr>
1 "CHINESE : 41"
2 "Men 4.5"
3 "Women : 3.7"
4 ""
5 "INDIAN 5.4"
6 "Men 4.7"
7 "Women : 6.1"
8 ""
9 "KOREAN : 5.5"
10 "Men 5.6"
# … with 13 more rows
Now we need to modify our function a bit for this new data.
First, we now have colons : in our table that we will want to separate by. Unfortunately, the text in each row isn’t extracted in the same way by the OCR. Some rows have only a space, while others have spaces around a colon; or for the row with VIETNAM we see a colon directly after the word followed by a space. Thus, we will modify our seperate() function with this change. We can specify that the separator between any letter and any digit should be either a space (\\s) or a colon with a space before and after it (\\s:\\s) using the or (|)operator.
So this will look like this:
asian_sub_2018 %>%
separate(., col = value,
into = c("Group", "Percent"),
sep = "(?<=[:alpha:])\\s:\\s|\\s(?=[0-9])")# A tibble: 23 x 2
Group Percent
<chr> <chr>
1 "CHINESE" 41
2 "Men" 4.5
3 "Women" 3.7
4 "" <NA>
5 "INDIAN" 5.4
6 "Men" 4.7
7 "Women" 6.1
8 "" <NA>
9 "KOREAN" 5.5
10 "Men" 5.6
# … with 13 more rows
Then because of the row with VIETNAM, we will want to remove this colon using the str_remove() function like this:
asian_sub_2018 %>%
separate(., col = value,
into = c("Group", "Percent"),
sep = "(?<=[:alpha:])\\s:\\s|\\s(?=[0-9])") %>%
mutate(Group= str_remove(string = Group, pattern = ":"))The other difference from the previous function, is that we want to fill in a new Race_Ethnicity variable with the previous rows. We can do so by first replacing “Men” or “Women” which with the or operator is (“Men|Women”), with “missing”. Then we need to convert these to NA values using the na_if() function of the dplyr package, we just need to specify what column to modify and what value to change to NA. Finally we will then replace the NA values with the previous non-NA value using the fill() function of the tidyr package. Note that this does not work inside of the mutate() function. We just need to simply specify what column to modify and then the direction to replace values. In this case we want to replace in the downward direction using the previous values.
This will look like this:
asian_sub_2018 %>%
separate(., col = value,
into = c("Group", "Percent"),
sep = "(?<=[:alpha:])\\s:\\s|\\s(?=[0-9])") %>%
mutate(Group = str_remove(string = Group, pattern = ":")) %>%
mutate(Race_Ethnicity = str_replace(string = Group,
pattern = "Men|Women",
replacement = "missing")) %>%
head()# A tibble: 6 x 3
Group Percent Race_Ethnicity
<chr> <chr> <chr>
1 "CHINESE" 41 "CHINESE"
2 "Men" 4.5 "missing"
3 "Women" 3.7 "missing"
4 "" <NA> ""
5 "INDIAN" 5.4 "INDIAN"
6 "Men" 4.7 "missing"
asian_sub_2018 %>%
separate(., col = value,
into = c("Group", "Percent"),
sep = "(?<=[:alpha:])\\s:\\s|\\s(?=[0-9])") %>%
mutate(Group = str_remove(string = Group, pattern = ":")) %>%
mutate(Race_Ethnicity = str_replace(string = Group,
pattern = "Men|Women",
replacement = "missing")) %>%
mutate(Race_Ethnicity = na_if(Race_Ethnicity, "missing")) %>%
head()# A tibble: 6 x 3
Group Percent Race_Ethnicity
<chr> <chr> <chr>
1 "CHINESE" 41 "CHINESE"
2 "Men" 4.5 <NA>
3 "Women" 3.7 <NA>
4 "" <NA> ""
5 "INDIAN" 5.4 "INDIAN"
6 "Men" 4.7 <NA>
asian_sub_2018 %>%
separate(., col = value,
into = c("Group", "Percent"),
sep = "(?<=[:alpha:])\\s:\\s|\\s(?=[0-9])") %>%
mutate(Group = str_remove(string = Group, pattern = ":")) %>%
mutate(Race_Ethnicity = str_replace(string = Group,
pattern = "Men|Women",
replacement = "missing")) %>%
mutate(Race_Ethnicity = na_if(Race_Ethnicity, "missing")) %>%
fill(Race_Ethnicity, .direction = "down") %>%
head()# A tibble: 6 x 3
Group Percent Race_Ethnicity
<chr> <chr> <chr>
1 "CHINESE" 41 "CHINESE"
2 "Men" 4.5 "CHINESE"
3 "Women" 3.7 "CHINESE"
4 "" <NA> ""
5 "INDIAN" 5.4 "INDIAN"
6 "Men" 4.7 "INDIAN"
OK! Now, let’s combine these pieces of our new function with the old pieces:
clean_table <- function(table){
table %>%
separate(., col = value,
into = c("Group", "Percent"),
sep = "(?<=[:alpha:])\\s:\\s|\\s(?=[0-9])") %>%
mutate(Group= str_remove(string = Group,
pattern = ":")) %>%
drop_na() %>%
mutate(Group = str_to_title(string = Group)) %>%
mutate(Percent = str_remove(string = Percent,
pattern = "\\.")) %>%
mutate(Percent = as.numeric(Percent)) %>%
mutate(Percent = Percent * 0.1) %>%
mutate(Race_Ethnicity = str_replace(string = Group,
pattern = "Men|Women",
replacement = "missing")) %>%
mutate(Race_Ethnicity = na_if(Race_Ethnicity, "missing")) %>%
fill(Race_Ethnicity, .direction = "down") %>%
mutate(Gender = str_extract(string = Group,
pattern = "Men|Women")) %>%
mutate(Gender = replace_na(Gender, replace = "All"))
}# A tibble: 17 x 4
Group Percent Race_Ethnicity Gender
<chr> <dbl> <chr> <chr>
1 Chinese 4.1 Chinese All
2 Men 4.5 Chinese Men
3 Women 3.7 Chinese Women
4 Indian 5.4 Indian All
5 Men 4.7 Indian Men
6 Women 6.1 Indian Women
7 Korean 5.5 Korean All
8 Men 5.6 Korean Men
9 Women 5.4 Korean Women
10 Vietnamese 6.3 Vietnamese All
11 Men 7.6 Vietnamese Men
12 Women 5 Vietnamese Women
13 Filipino 6.8 Filipino All
14 Men 6.3 Filipino Men
15 Women 7.4 Filipino Women
16 Hmong 10.2 Hmong All
17 Cambodian 13.8 Cambodian All
Looking good!
Now we just need to add the data for all Asians from the text.
We can do this using the add_row() function of the dplyr() package.
asian_sub_2018 %<>%
add_row(Group = "Asian",
Percent = 6.2,
Race_Ethnicity = "Asian",
Gender = "All") %>%
add_row(Group = "Asian",
Percent = 6.4,
Race_Ethnicity = "Asian",
Gender = "Men") %>%
add_row(Group = "Asian",
Percent = 6.1,
Race_Ethnicity = "Asian",
Gender = "Women")
asian_sub_2018# A tibble: 20 x 4
Group Percent Race_Ethnicity Gender
<chr> <dbl> <chr> <chr>
1 Chinese 4.1 Chinese All
2 Men 4.5 Chinese Men
3 Women 3.7 Chinese Women
4 Indian 5.4 Indian All
5 Men 4.7 Indian Men
6 Women 6.1 Indian Women
7 Korean 5.5 Korean All
8 Men 5.6 Korean Men
9 Women 5.4 Korean Women
10 Vietnamese 6.3 Vietnamese All
11 Men 7.6 Vietnamese Men
12 Women 5 Vietnamese Women
13 Filipino 6.8 Filipino All
14 Men 6.3 Filipino Men
15 Women 7.4 Filipino Women
16 Hmong 10.2 Hmong All
17 Cambodian 13.8 Cambodian All
18 Asian 6.2 Asian All
19 Asian 6.4 Asian Men
20 Asian 6.1 Asian Women
OK, now we just want to combine the 2018 data and the 2017 data for the Asian subgroups.
First let’s add a variable for year to both. Using mutate() we can add a variable Year where all values are 2017 like so:
# A tibble: 25 x 5
Group Percent Race_Ethnicity Gender Year
<chr> <dbl> <chr> <chr> <dbl>
1 United States 11.5 All_races All 2017
2 Male 11.8 All_races Male 2017
3 Female 11.1 All_races Female 2017
4 Asian 6.6 Asian All 2017
5 Asian Male 6.5 Asian Male 2017
6 Asian Female 6.7 Asian Female 2017
7 Chinese 4.3 Chinese All 2017
8 Chinese Male 4.7 Chinese Male 2017
9 Chinese Female 3.9 Chinese Female 2017
10 Vietnamese 5.5 Vietnamese All 2017
# … with 15 more rows
# A tibble: 20 x 5
Group Percent Race_Ethnicity Gender Year
<chr> <dbl> <chr> <chr> <dbl>
1 Chinese 4.1 Chinese All 2018
2 Men 4.5 Chinese Men 2018
3 Women 3.7 Chinese Women 2018
4 Indian 5.4 Indian All 2018
5 Men 4.7 Indian Men 2018
6 Women 6.1 Indian Women 2018
7 Korean 5.5 Korean All 2018
8 Men 5.6 Korean Men 2018
9 Women 5.4 Korean Women 2018
10 Vietnamese 6.3 Vietnamese All 2018
11 Men 7.6 Vietnamese Men 2018
12 Women 5 Vietnamese Women 2018
13 Filipino 6.8 Filipino All 2018
14 Men 6.3 Filipino Men 2018
15 Women 7.4 Filipino Women 2018
16 Hmong 10.2 Hmong All 2018
17 Cambodian 13.8 Cambodian All 2018
18 Asian 6.2 Asian All 2018
19 Asian 6.4 Asian Men 2018
20 Asian 6.1 Asian Women 2018
You may notice that Gender is coded differently for the two years. Let’s make this consistent now:
asian_sub_2018 %<>%
mutate(across(.cols = c(Gender, Group),
~ str_replace(string = .,
pattern = "Men",
replacement = "Male")),
across(.cols = c(Gender, Group),
~ str_replace(string = .,
pattern = "Women",
replacement = "Female")))We can combine these two tibbles using the bind_rows() function of dplyr.
Notice that there are some cases where we only have one value for a particular group. For example, there are no male or female values for the Pakistani data.
We would like to have NA values for the comparable years/genders that are possible. We can fill out the rest of the table with NA values by performing the pivot_wider() and pivot_longer() functions sequentially like so:
asian_subgroups %<>%
select(-Group) %>%
pivot_wider(names_from = Year, values_from = Percent) %>%
pivot_longer(cols = -c(Race_Ethnicity, Gender),
names_to ="Year" ,
values_to="Percent") Click here to see how this interactive table was created
This table was created using the datatable() function of the DT package.
Like so:
Great, now we are ready to perform similar wrangling for the Latinx subgroups.
Recall that this was the page with the table of interest for the Latinx subgroup 2018 data:
In this case only a single image was needed:
latinx_sub_2018 <- image_read(here::here("img", "latinx_subgroups_2018.png"))
latinx_sub_2018 <- image_ocr(latinx_sub_2018)
latinx_sub_2018[1] "SOUTH :\n\nAMERICAN : 8.0\nMen 7.5\nMEXICAN 12.9\nMen 12.0\nWomen 13.8\nPR, DR, CUBAN 13.7\nMen 14.9\nWomen 12.4\nCENTRAL\n\nAMERICAN : 13.7\nMen 11.8\nWomen : 15.9\n"
Let’s first combine the South and Central American labels. Notice that there are multiple new line expressions in between and we don’t see repeated \n characters elsewhere. We can replace the pattern of exactly two \n (using \n{2} to specify exactly 2) or two newline regex with a space and colon in front with a single space.
latinx_sub_2018 <- str_replace_all(string = latinx_sub_2018,
pattern = "\\s:\n{2}|\n{2}", replacement = " ")
latinx_sub_2018 [1] "SOUTH AMERICAN : 8.0\nMen 7.5\nMEXICAN 12.9\nMen 12.0\nWomen 13.8\nPR, DR, CUBAN 13.7\nMen 14.9\nWomen 12.4\nCENTRAL AMERICAN : 13.7\nMen 11.8\nWomen : 15.9\n"
# A tibble: 12 x 1
value
<chr>
1 "SOUTH AMERICAN : 8.0"
2 "Men 7.5"
3 "MEXICAN 12.9"
4 "Men 12.0"
5 "Women 13.8"
6 "PR, DR, CUBAN 13.7"
7 "Men 14.9"
8 "Women 12.4"
9 "CENTRAL AMERICAN : 13.7"
10 "Men 11.8"
11 "Women : 15.9"
12 ""
# A tibble: 11 x 4
Group Percent Race_Ethnicity Gender
<chr> <dbl> <chr> <chr>
1 South American 8 South American All
2 Men 7.5 South American Men
3 Mexican 12.9 Mexican All
4 Men 12 Mexican Men
5 Women 13.8 Mexican Women
6 Pr, Dr, Cuban 13.7 Pr, Dr, Cuban All
7 Men 14.9 Pr, Dr, Cuban Men
8 Women 12.4 Pr, Dr, Cuban Women
9 Central American 13.7 Central American All
10 Men 11.8 Central American Men
11 Women 15.9 Central American Women
Again we will replace Pr, Dr, Cuban:
latinx_sub_2018 %<>%
mutate(Group =
str_replace(string = Group,
pattern = "Pr, Dr, Cuban",
replacement = "Puerto Rican, Dominican, Cuban"),
Race_Ethnicity =
str_replace(string = Race_Ethnicity,
pattern = "Pr, Dr, Cuban",
replacement = "Puerto Rican, Dominican, Cuban"))We also want to add the total Latinx values according to the text:
Question Opportunity
Can you recall how to add additional rows?
Click here to reveal the code.
# A tibble: 6 x 4
Group Percent Race_Ethnicity Gender
<chr> <dbl> <chr> <chr>
1 Central American 13.7 Central American All
2 Men 11.8 Central American Men
3 Women 15.9 Central American Women
4 Latinx 12.8 Latinx All
5 Latinx 12.3 Latinx Men
6 Latinx 13.3 Latinx Women
And now we will recode gender like before to be consistent:
latinx_sub_2018 %<>%
mutate(across(.cols = c(Gender, Group),
~ str_replace(string = .,
pattern = "Men",
replacement = "Male")),
across(.cols = c(Gender, Group),
~ str_replace(string = .,
pattern = "Women",
replacement = "Female")))
head(latinx_sub_2018)# A tibble: 6 x 4
Group Percent Race_Ethnicity Gender
<chr> <dbl> <chr> <chr>
1 South American 8 South American All
2 Male 7.5 South American Male
3 Mexican 12.9 Mexican All
4 Male 12 Mexican Male
5 Female 13.8 Mexican Female
6 Puerto Rican, Dominican, Cuban 13.7 Puerto Rican, Dominican, Cuban All
Now we just need to combine all the data for the Latinx subgroups.
Again, first we will add a year variable to both the 2017 and 2018 data.
latinx_sub_2017 %<>%
mutate(Year = 2017)
latinx_sub_2018 %<>%
mutate(Year = 2018)
latinx_subgroups <- bind_rows(latinx_sub_2017, latinx_sub_2018)
latinx_subgroups# A tibble: 32 x 5
Group Percent Race_Ethnicity Gender Year
<chr> <dbl> <chr> <chr> <dbl>
1 Latinx 13.2 Latinx All 2017
2 Latinx Male 12.4 Latinx Male 2017
3 Latinx Female 13.9 Latinx Female 2017
4 South American 8.4 South American All 2017
5 South American Male 9.1 South American Male 2017
6 South American Female 7.7 South American Female 2017
7 Central American 12 Central American All 2017
8 Central American Male 9.3 Central American Male 2017
9 Central American Female 15 Central American Female 2017
10 Mexican 13.3 Mexican All 2017
# … with 22 more rows
Again, we would like to have NA values for the comparable years/genders that are possible. We will fill out the rest of the table with NA values by performing the pivot_wider() and pivot_longer() functions sequentially like so:
latinx_subgroups %<>%
select(-Group) %>%
pivot_wider(names_from = Year, values_from = Percent) %>%
pivot_longer(cols = -c(Race_Ethnicity, Gender),
names_to ="Year" ,
values_to="Percent") Click here to see how this interactive table was created
This table was created using the datatable() function of the DT package.
Like so:
OK, now let’s make sure that our notations match across our different tables. For example in the first report the terms male and female where used, but in the second report men and women were used. Let’s make sure everything is consistent now.
# A tibble: 108 x 5
Group Race_Ethnicity Gender Year Percent
<chr> <chr> <chr> <dbl> <dbl>
1 United States All_races All 2008 12.6
2 United States All_races All 2010 14.7
3 United States All_races All 2012 14.1
4 United States All_races All 2014 13.2
5 United States All_races All 2016 11.7
6 United States All_races All 2017 11.5
7 Male All_races Male 2008 12.3
8 Male All_races Male 2010 15.2
9 Male All_races Male 2012 14.5
10 Male All_races Male 2014 13.3
# … with 98 more rows
# A tibble: 45 x 5
Group Percent Race_Ethnicity Gender Year
<chr> <dbl> <chr> <chr> <dbl>
1 United States 11.5 All_races All 2017
2 Male 11.8 All_races Male 2017
3 Female 11.1 All_races Female 2017
4 Asian 6.6 Asian All 2017
5 Asian Male 6.5 Asian Male 2017
6 Asian Female 6.7 Asian Female 2017
7 Chinese 4.3 Chinese All 2017
8 Chinese Male 4.7 Chinese Male 2017
9 Chinese Female 3.9 Chinese Female 2017
10 Vietnamese 5.5 Vietnamese All 2017
# … with 35 more rows
# A tibble: 32 x 5
Group Percent Race_Ethnicity Gender Year
<chr> <dbl> <chr> <chr> <dbl>
1 Latinx 13.2 Latinx All 2017
2 Latinx Male 12.4 Latinx Male 2017
3 Latinx Female 13.9 Latinx Female 2017
4 South American 8.4 South American All 2017
5 South American Male 9.1 South American Male 2017
6 South American Female 7.7 South American Female 2017
7 Central American 12 Central American All 2017
8 Central American Male 9.3 Central American Male 2017
9 Central American Female 15 Central American Female 2017
10 Mexican 13.3 Mexican All 2017
# … with 22 more rows
Looks good! If you made it this far, you should pat yourself on your back. That was a lot of wrangling!
Recall what our main questions were?
Our main questions:
Now that we have wrangled our data and made it easy to work with, let’s create some visualizations to explore these questions.
We are particularly interested in being able to reproduce the plot below from the report, as we would like to make similar looking plots.
In general, it is very useful to learn how to reproduce the style of a plot.
There are color identifying websites such as this.
Using one of these websites, we identify the hex triplet code for the color used in the visualization included in the PDF : #008393. Thus we will use this color our plot.
We can create a version of the above plot using the ggplot2 package of the tidyverse to create our plots.
Click here for an introduction about this package if you are new to using
ggplot2
The ggplot2 package is generally intuitive for beginners because it is based on a grammar of graphics or the gg in ggplot2. The idea is that you can construct many sentences by learning just a few nouns, adjectives, and verbs. There are specific “words” that we will need to learn and once we do, you will be able to create (or “write”) hundreds of different plots.
The critical part to making graphics using ggplot2 is the data needs to be in a tidy format. Given that we have just spent time putting our data in tidy format, we are primed to take advantage of all that ggplot2 has to offer!
We will show how it is easy to pipe tidy data (output) as input to other functions that create plots. This all works because we are working within the tidyverse.
What is the ggplot() function? As explained by Hadley Wickham:
The grammar tells us that a statistical graphic is a mapping from data to aesthetic attributes (colour, shape, size) of geometric objects (points, lines, bars). The plot may also contain statistical transformations of the data and is drawn on a specific coordinates system.
ggplot2 Terminology:
x and y variable names)geom_point(), geom_bar(), geom_line(), geom_histogram()The function aes() is an aesthetic mapping function inside the ggplot() object. We use this function to specify plot attributes (e.g. x and y variable names) that will not change as we add more layers.
Anything that goes in the ggplot() object becomes a global setting. From there, we use the geom objects to add more layers to the base ggplot() object. These will define what we are interested in illustrating using the data.
So for this first plot we just want to recreate the plot in the report. We will use the major_groups_long tibble and we will filter for only the "All" values of the Gender column. We also want to exclude the data for the United states. We can exclude the data from the US using the not equals != operator like so:
# A tibble: 30 x 5
Group Race_Ethnicity Gender Year Percent
<chr> <chr> <chr> <dbl> <dbl>
1 Asian Asian All 2008 7.1
2 Asian Asian All 2010 8.5
3 Asian Asian All 2012 7.8
4 Asian Asian All 2014 7.9
5 Asian Asian All 2016 6.6
6 Asian Asian All 2017 6.6
7 White White All 2008 9.7
8 White White All 2010 11.7
9 White White All 2012 11.2
10 White White All 2014 10.8
# … with 20 more rows
The data from the tables only included up to 2017, but this will still allow us to create a similar plot.
Now we have the data ready to make the plot… but how do we actually start making the plot?
First, we start with the ggplot() function of the ggplot2 package.
This function requires that the aesthetics aes() be specified. This involves choosing what variable will be plotted on the x-axis and the y axis. It also involves choosing variables to color or group our plot by.
In our case, we want to plot the percent of disconnection on the y-axis (thus the Percent variable of the major_groups_long data) and the Year on the x-axis. We want to separate each racial or ethnic group to have their own points/lines. Thus we will use the color argument for this variable, as we intend to color our plot something other than black.
If we run the following code, we get an empty plot.
major_groups_long %>%
filter(Gender == "All",
Group != "United States") %>%
ggplot(aes(x = Year, y = Percent, color = Race_Ethnicity))The next thing we need to do is add ggplot2 layers using the + operator to specify how we want the data to be displayed on our plot.
We would like both points and lines. We will use the geom_point() and geom_line() functions of the ggplot2 package to do this. The layer we add first will be plotted first and so on with the next layers. We will also specify the size of these elements using the size argument.
major_groups_long %>%
filter(Gender == "All", Group != "United States") %>%
ggplot(aes(x = Year, y = Percent, color = Race_Ethnicity)) +
geom_line(size = 0.5) +
geom_point(size = 3) OK, not bad, but we have quite a bit to work on to make the plot style match.
First, we will update the x-axis and y-axis, to look more similar to the plot from the report. We can specify the location of the tick marks using breaks argument of the scale_x_continuous() and scale_y_continuous() functions (also of the ggplot2 package). These functions also allow for specification of the range or limits of the axis using the limits argument. We can use the base seq() function to create a sequence of numbers for each tick mark. We will make the x-axis upper limit a bit larger to allow for the image of the figure and the label for each group.
Next, we will change the color to match the one that we identified, by using the scale_color_manual() function of the ggplot2 package. This requires color values for each group. In our case, we have 5 groups, so we repeat this value 5 times using the base rep() function.
We will also change the overall look of the plot using the theme_classic() function. See here for a list of options.
youth_plot <-
major_groups_long %>%
filter(Gender == "All", Group != "United States") %>%
ggplot(aes(x = Year, y = Percent, color = Race_Ethnicity)) +
geom_line( size = 0.5) +
geom_point( size = 3) +
scale_x_continuous(breaks = seq(2008,2018, by=1),
limits = c(2008,2020)) +
scale_y_continuous(breaks = seq(5,30, by =5),
limits = c(5,30)) +
scale_color_manual(values = c(rep( "#008393", 5))) +
theme_classic()
youth_plotNow let’s add some labels. We can add a title using the title argument and a y-axis title using the y argument of the labs() function.
youth_plot <-
youth_plot +
labs(title = "YOUTH DISCONNECTION BY RACE AND ETHNICITY, 2008 - 2017",
y = "YOUTH DISCONNECTION (%)") +
theme(title = element_text(size = 10,
color = "#008393",
face = "bold"),
axis.title.x = element_blank())
youth_plot <-
directlabels::direct.label(youth_plot,
list(dl.trans(x = x+1.2, y = y + 0),
"last.points"))
youth_plotGetting close!
Now we need to add figurine icons to the plot.
To add the figurines we can use icons from free resources like font awesome. See here for a link to svg options. Or we could also take a screenshot of just the figurine from the original plot. However, it is useful to know how to create similar icons if we didn’t have the original plot to work with so we will start with an example of using a font awesome icon.
First we need to import the image from font awesome:
fa_figurine <- image_read("https://upload.wikimedia.org/wikipedia/commons/7/7c/User_font_awesome.svg")
fa_figurineWe would like to change the color of the icon to match that of our plot. We can do so using the image_fill() function of the magick package. We need to list the name of the image object to change, then we need to list the color we want to use. Finally we need to list what points we want to color. We don’t want to color the space around the black shapes, just the black shapes themselves. It may take some trial and error to figure out where these pixels are.
According to the documentation for the magick package: >The image_fill() function performs flood-fill by painting starting point and all neighboring pixels of approximately the same color. > point = a geometry_point string indicating the starting point of the flood-fill
First we will color in the circle at the top of the icon. If we use point = +600+600 this is starting at the pixel at x:600, y:600. We can see from the output about the fa_figurine that the image is 1792 by 1792 pixels in size. Thus, if we choose a pixel that is within the borders of the shape we want to change then we can change that shape. We can use the size of the image to figure out where the pixel might be. Since the image is 1792 by 1792 pixels, 800 is about half way through the image image from the left side (x:0) and 600 is about a third of the way through the and the top (y:0).
Second we will color in the lower shape of the icon. This shape is in the lower half of the image and in the middle of the image according to the x axis, thus we will start at the pixel that is at about half way for the width at x:800 and a bit more than half way down from the top (y:0) for the height y:1000.
Nice! We changed the color of the image.
Now let’s try using a screenshot of the original plot.
Now to add the image of the figurine to the plot we can use the draw_image() function of the cowplot package. We simply need to specify the name of the image, and then where we would like it to go according to the x- and y-axis and finally the scaled size of the image. This scale argument also takes a bit of trial and error.
Here we will add both images of the figurines to a couple of groups:
youth_plot +
cowplot::draw_image(figurine, x = 2017, y = 23.5, scale = 4) +
cowplot::draw_image(fa_figurine, x = 2017, y = 17.5, scale = 2)Nice!
Now we just need to add this for all of the groups and we also will want to replace our Native American label with a two line version so that we look as similar to the report plot as possible.
Question Opportunity
How might we change the Native American label to have two lines? (hint: think about how we separated the rows of our tables)
Click here to reveal the code.
youth_plot <- major_groups_long %>%
filter(Gender == "All",
Group != "United States") %>%
mutate(Race_Ethnicity = str_replace(string = Race_Ethnicity,
pattern = "Native American",
replacement = "Native\nAmerican")) %>%
ggplot(aes(x = Year, y = Percent, color = Race_Ethnicity)) +
geom_line( size = 0.5) +
geom_point( size = 3) +
scale_x_continuous(breaks = seq(2008,2018, by=1),
limits = c(2008,2019)) +
scale_y_continuous(breaks = seq(5,30, by =5),
limits = c(5,30)) +
theme_classic() +
labs(title = "YOUTH DISCONNECTION BY RACE AND ETHNICITY, 2008 - 2017",
y = "YOUTH DISCONNECTION (%)") +
theme(title = element_text(size = 10,
color = "#008393",
face = "bold"),
axis.title.x = element_blank())+
scale_color_manual(values = c(rep( "#008393", 5)))+
draw_image(figurine, x = 2017.25, y = 23.5, scale = 4) +
draw_image(figurine, x = 2017.25, y = 17.5, scale = 4) +
draw_image(figurine, x = 2017.25, y = 13.5, scale = 4) +
draw_image(figurine, x = 2017.25, y = 9, scale = 4) +
draw_image(figurine, x = 2017.25, y = 5.5, scale = 4)
youth_plot <- directlabels::direct.label(youth_plot, list(dl.trans(x = x + 1.4,
y = y + 0),
"last.points")) Nice! Our plot looks very similar stylistically to the plot in the report.
We can see from this plot that Native Americans have a very high rate of youth disconnection, with roughly still a quarter of the population experiencing youth disconnection in this survey. Although some groups show a downward trend, like the Latinx group, the level has been fairly stable since 2008, as we see that the slope of each line is fairly flat. We will investigate this further in our analysis.
Now let’s create a similar plot for females and males. To do this we can simply use the facet_wrap() function. We need to specify what variable we want to facet by, in this case the Gender variable, by using the ~ operator. We also want to set the scales argument to "free", so that we have a y-axis for both the female and male plots. See this case study for more information about faceting plots.
We do not want to plot the "All" values for the Gender variable as we have already shown that in the previous plot, so we again exclude it.
Question Opportunity
How do we modify the plot above to not include the "All" values for the Gender variable and stratify by Gender?
Click here to reveal the code.
plot_by_gender <- major_groups_long %>%
filter(Race_Ethnicity != "All_races", Gender != "All") %>%
mutate(Race_Ethnicity = str_replace(string = Race_Ethnicity,
pattern = "Native American",
replacement = "Native\nAmerican"),
Gender = str_replace(string = Gender,
pattern = "Female",
replacement = "FEMALE"),
Gender = str_replace(string = Gender,
pattern = "Male",
replacement = "MALE")) %>%
ggplot(aes(x = Year, y = Percent, color = Race_Ethnicity)) +
geom_line(size = 0.5) +
geom_point(size = 3) +
facet_wrap(~Gender, scales = "free") +
scale_x_continuous(breaks = seq(2008,2018, by=1),
limits = c(2008,2020)) +
scale_y_continuous(breaks = seq(5,30, by =5),
limits = c(5,31)) +
theme_classic() +
labs(title = "YOUTH DISCONNECTION BY GENDER AND RACIAL/ETHNIC GROUP",
y = "YOUTH DISCONNECTION (%)") +
theme(title = element_text(size = 10,
color = "#008393",
face = "bold"),
axis.title.x = element_blank(),
axis.text.x = element_text(angle = 90),
strip.background =element_rect(fill="#003661"),
strip.text = element_text(colour = 'white', face = "bold",
size = 12)) +
scale_color_manual(values = c(rep( "#008393", 6)))
plot_by_gender <- direct.label(plot_by_gender, list(dl.trans(x = x+0.3, y = y +0), "last.points"))It’s a bit difficult to tell the Black female and Latinx female lines apart.
Let’s try adding another color for every other line. Again we will use the scale_color_manual() function to specify the colors of the lines and points on the plot.
We will use black as the other color. We can use the hex triplet code or we can write out the name of the color, in this case we will write "black". To specifically color each group differently, we can write the colors in the alphabetically order of the distinct values of Race_Ethnicity. Thus the first color listed will be the color for the Asian values, while the last will be the color for the White values.
Question Opportunity
Can you explain why we didn’t use a pipe in the previous code chunk?
Very interesting! By parsing the groups further into intersections of racial and ethnic groups with gender, we see that there are some very striking differences for the Black and Latinx groups when stratified by Gender. One positive thing that we can see, is that there has been a steep decline in youth disconnection for Latinx females. Unfortunately, the level of disconnection for females in 2008 was quite high, so now Latinx females have similar rates to Latinx males. In contrast, we see that Black males have had and continue to have much higher rates of disconnection than Black females. Both Black females and males show an increased rate since 2016. Native American females also show an increased rate since 2016.
If you have time, the next section will show how to make bar plots to look for differences in subgroups for the Asian and Latinx data. Otherwise, you can skip ahead to the Data Analysis section.
It is clear from the previous plot that observing more specific subgroups in our data can be very informative!
Thus, we also want to make plots of the Asian and Latinx subgroups, to see if there are particular ethnic groups that have higher levels of youth disconnection. If so, these groups may particularly benefit for prevention and re-engagement efforts.
We will also attempt to continue to plot the genders separately, as we have learned that there may be important gender group differences among the racial and ethnic groups. However, the data is incomplete for some of the ethnic groups. Recall that we also only have two years of data for both our Asian and Latinx subgroup data. First we will start by plotting just the subgroups over the two years.
We can continue to make our plots look like they match the report by using a color palette based off the color used in the report. We can use the colorRampPalette() function of the grDevices package, which is loaded automatically in an RStudio session. We can specify that we want the colors to range from gray to the color that we have been using.
Then we just need to specify how many colors we want in our gradient when we choose to use the color palette. Say we wanted 4 colors, we just need to type custom_pal(4) to get a list of 4 colors within the palette. Thus we will have gray, the teal color we have been using, and two shades in between. After doing this we chose to use one of the colors in between and to use just two colors for our palette.
[1] "#BEBEBE" "#7EAAAF" "#3F96A1" "#008393"
Here we can see the hex triplet codes for gray and the shade in between.
OK, now let’s start with the Asian subgroups. Since we only have two time points of data for each group, we will create a bar graph this time where the two years are displayed next to each other for each year. Thus, we want the Year variable to be a discrete variable. To ensure that it is interpreted as a discrete variable, we need to convert it to be a character type variable rather than numeric using the base as.character() function.
First, we want to filter for the rows where the Gender variable values are All, we also do not want to include the data for the average of all races.
asian_subgroups %>%
filter(Gender == "All") %>%
filter(Race_Ethnicity != "All_races") %>%
mutate(Year = as.character(Year))# A tibble: 17 x 5
Group Percent Race_Ethnicity Gender Year
<chr> <dbl> <chr> <chr> <chr>
1 Asian 6.6 Asian All 2017
2 Chinese 4.3 Chinese All 2017
3 Vietnamese 5.5 Vietnamese All 2017
4 Indian 5.9 Indian All 2017
5 Pakistani 6.4 Pakistani All 2017
6 Korean 6.5 Korean All 2017
7 Two Or More 6.6 Two Or More All 2017
8 Filipino 7.3 Filipino All 2017
9 Hmong 14 Hmong All 2017
10 Chinese 4.1 Chinese All 2018
11 Indian 5.4 Indian All 2018
12 Korean 5.5 Korean All 2018
13 Vietnamese 6.3 Vietnamese All 2018
14 Filipino 6.8 Filipino All 2018
15 Hmong 10.2 Hmong All 2018
16 Cambodian 13.8 Cambodian All 2018
17 Asian 6.2 Asian All 2018
Now that we have wrangled the data we can make our plot. To make a bar plot there are two main ggplot2 functions, geom_col() and geom_bar(). The geom_col() function plots the actual values of the data, while the geom_bar()function plots counts (however you can override this with the stat = identity argument). We are interested in plotting the actual values, so we will use the goem_col() function.
This time we will specify that the Year variable be used to specify the fill color of the bars of the bar plot by using aes(fill = Year).
We also need to indicate how the bars should be plotted based on the position argument.
The options for the position argument are:
stack - the years would be displayed on top of one another (this is default)dodge - the years would be displayed next to one another with no space in betweendodge2 - the years would be displayed next to one another with a space in betweenfill - the years would be displayed on top of one other, where the heights of each color would show the relative proportions for each year adding up to 1, thus each bar would have the same heightWe will use the dodge option.
For the colors, we can try out various numbers of shades for the palette until we get shades that we like. In this case, 2 shades out of a gradient of 4 shades from gray to teal looks nice,
asian_subgroups %>%
filter(Gender == "All", Race_Ethnicity != "All_races") %>%
mutate(Year = as.character(Year)) %>%
ggplot(aes(x = Race_Ethnicity, y = Percent)) +
geom_col(aes(fill = Year), position = "dodge")+
labs(title = "YOUTH DISCONNECTION BY ASIAN SUBGROUP, 2017-2018",
y = "YOUTH DISCONNECTION (%)") +
scale_fill_manual(values = custom_pal) +
theme_classic() +
theme(title = element_text(size = 10, color = "#008393",
face = "bold"),
axis.title.x = element_blank(),
axis.text.x = element_text(angle = 90, hjust = 1))OK, this looks pretty nice! However, we can improve this a bit.
Before we do so, let’s create a theme for our future similar plots like so:
bar_theme <- function() {
theme_classic() +
theme(title = element_text(size = 10,
color = "#008393",
face = "bold"),
axis.title.x = element_blank(),
axis.text.x = element_text(angle = 90, hjust = 1))
}Now we simply need to type bar_theme() instead to achieve the same style for our plot.
It would be nice if for subgroups that only have one year of data, if the column that was displayed was still the same width as the that of the other groups. To do this we need a row with an NA value for the year that is missing. One way to accomplish this is to use the pivot_wider() and pivot_longer() functions to widen the data based on the Year variable and the collapse the data based on the Year variable. By widening our data, we create the NA values we want; however we need to collapse our data back to the long format so that we can easily use ggplot2 to plot the year as the fill in our bar plot.
Note that we no longer need to change the type of the Year variable, as it is automatically converted to type character.
asian_subgroups %>%
filter(Gender == "All", Race_Ethnicity != "All_races") %>%
dplyr::select(-Group) %>%
pivot_wider(names_from = Year, values_from = Percent)# A tibble: 10 x 4
Race_Ethnicity Gender `2017` `2018`
<chr> <chr> <dbl> <dbl>
1 Asian All 6.6 6.2
2 Chinese All 4.3 4.1
3 Vietnamese All 5.5 6.3
4 Indian All 5.9 5.4
5 Pakistani All 6.4 NA
6 Korean All 6.5 5.5
7 Two Or More All 6.6 NA
8 Filipino All 7.3 6.8
9 Hmong All 14 10.2
10 Cambodian All NA 13.8
Great! Now we have NA values. Now we just to get the data back into long format:
asian_subgroups %>%
filter(Gender == "All", Race_Ethnicity != "All_races") %>%
dplyr::select(-Group) %>%
pivot_wider(names_from = Year, values_from = Percent) %>%
pivot_longer(cols = -c(Race_Ethnicity, Gender),
names_to ="Year" ,
values_to="Percent")# A tibble: 20 x 4
Race_Ethnicity Gender Year Percent
<chr> <chr> <chr> <dbl>
1 Asian All 2017 6.6
2 Asian All 2018 6.2
3 Chinese All 2017 4.3
4 Chinese All 2018 4.1
5 Vietnamese All 2017 5.5
6 Vietnamese All 2018 6.3
7 Indian All 2017 5.9
8 Indian All 2018 5.4
9 Pakistani All 2017 6.4
10 Pakistani All 2018 NA
11 Korean All 2017 6.5
12 Korean All 2018 5.5
13 Two Or More All 2017 6.6
14 Two Or More All 2018 NA
15 Filipino All 2017 7.3
16 Filipino All 2018 6.8
17 Hmong All 2017 14
18 Hmong All 2018 10.2
19 Cambodian All 2017 NA
20 Cambodian All 2018 13.8
Great, now let’s see how this looks in the plot.
asian_subgroups %>%
filter(Gender == "All", Race_Ethnicity != "All_races") %>%
dplyr::select(-Group) %>%
pivot_wider(names_from = Year, values_from = Percent) %>%
pivot_longer(cols = -c(Race_Ethnicity, Gender),
names_to ="Year" ,
values_to="Percent") %>%
ggplot(aes(x = Race_Ethnicity, y = Percent)) +
geom_col(aes(fill = Year), position = "dodge")+
labs(title = "YOUTH DISCONNECTION BY ASIAN SUBGROUP, 2017-2018",
y = "YOUTH DISCONNECTION (%)") +
scale_fill_manual(values = custom_pal) +
bar_theme()Great! that is looking better! However, currently the subgroups are plotted on the x-axis by alphabetically order and we want to instead order the subgroups based on the percentage of youth disconnection. We can use the forcats package to do this.
The fct_reorder() function can be used to order the Race_Ethnicity variable based on the average of the Percent variable for the two years, while the fct_relevel() function can be used to make the Asian level appear first for comparison sake. Importantly, we need to do this before we reshape the data, because subgroups with NA values will be placed at the end.
asian_subgroups %>%
filter(Gender == "All", Race_Ethnicity != "All_races") %>%
mutate(Race_Ethnicity = fct_reorder(Race_Ethnicity, Percent),
Race_Ethnicity = fct_relevel(Race_Ethnicity, "Asian")) %>%
select(-Group) %>%
pivot_wider(names_from = Year, values_from = Percent) %>%
pivot_longer(cols = -c(Race_Ethnicity, Gender),
names_to ="Year", values_to="Percent")%>%
ggplot(aes(x = Race_Ethnicity, y = Percent)) +
geom_col(aes(fill = Year), position = "dodge")+
labs(title = "YOUTH DISCONNECTION BY ASIAN SUBGROUP",
subtitle = "ORDERED BY AVERAGE DISCONNECTION LEVEL OF 2017 & 2018",
y = "YOUTH DISCONNECTION (%)") +
scale_fill_manual(values = custom_pal) +
bar_theme() This is looking very good! Now, let’s make a gap between the Asian average group and the subgroups to more easily differentiate between the two and let’s use that space to add the figure legend to the plot itself.
To first make a gap, we can use the scale_x_discrete() function. We need to write the name of the levels of the Race_Ethnicity variable and place a blank in between the Asian level and the subsequent levels. We can use the levels() function to make this more reproducible and to avoid mistakes instead of actually writing out each of the levels by hand.
Here you can see all the levels:
asian_subgroups %>%
dplyr::select(-Group) %>%
filter(Gender == "All", Race_Ethnicity != "All_races") %>%
mutate(Race_Ethnicity = fct_reorder(Race_Ethnicity, Percent),
Race_Ethnicity = fct_relevel(Race_Ethnicity, "Asian")) %>%
pivot_wider(names_from = Year, values_from = Percent) %>%
pivot_longer(cols = -c(Race_Ethnicity, Gender),
names_to = "Year", values_to = "Percent")%>%
pull(Race_Ethnicity) %>%
levels() [1] "Asian" "Chinese" "Indian" "Vietnamese" "Korean"
[6] "Pakistani" "Two Or More" "Filipino" "Hmong" "Cambodian"
To simplify our code we will save our wrangled data as a tibble called df_asian_subgroup.
So we would like our x-axis to be like this:
scale_x_discrete(
limits = c("Asian",
"Blank",
"NEXT_blank",
levels(pull(df_asian_subgroup, Race_Ethnicity))[2:10]),
labels = c("Blank" = "",
"NEXT_blank"= ""))The levels(pull(df_asian_subgroup, Race_Ethnicity))[2:10]) code gets all 9 other levels but the Asian level of the Race_Ethnicity variable. The labels argument specifies what the x-axis should say for the spaces in between. We would like them to say nothing, so we use nothing within quotes "" as the label.
We can also use this as an opportunity to change the label on the plot for the Asian level of the Race_Ethnicity variable to be "Asian avg.". We don’t want to change the other subgroup level labels so we will not list anything.
We can make this more reproducible by using the length() function instead of the number 10 and replacing Asian with the first level:
scale_x_discrete(
limits = c(levels(pull(df_asian_subgroup, Race_Ethnicity))[1],
"Blank",
"NEXT_blank",
levels(pull(df_asian_subgroup, Race_Ethnicity))[2:
length(levels(pull(df_asian_subgroup, Race_Ethnicity)))]),
labels = c("Asian" = "Asian Avg.",
"Blank" = "",
"NEXT_blank"= ""))Then we can use legend.justification and legend.position of the theme() function to move our legend to the plot area.
According to the documentation for the theme() function of the ggplot2 package, the legend.justification argument specifies the:
anchor point for positioning legend inside plot (“center” or two-element numeric vector) or the justification according to the plot area when positioned outside the plot
By setting the mapping for the positioning and justification as c(0.12,0.1) we specify that we want the legend to be 12 percent of the plot area from the y-axis and 10 percent of the plot area from the x-axis.
df_asian_subgroup <-
asian_subgroups %>%
dplyr::select(-Group) %>%
filter(Gender == "All",
Race_Ethnicity != "All_races") %>%
mutate(Race_Ethnicity = fct_reorder(Race_Ethnicity, Percent),
Race_Ethnicity = fct_relevel(Race_Ethnicity, "Asian")) %>%
pivot_wider(names_from = Year, values_from = Percent) %>%
pivot_longer(cols = -c(Race_Ethnicity, Gender),
names_to = "Year", values_to = "Percent")
asian_subgroup_plot <-
df_asian_subgroup %>%
ggplot(aes(x = Race_Ethnicity, y = Percent, fill = Year)) +
geom_col(aes(fill = Year), position = "dodge")+
labs(title = "YOUTH DISCONNECTION BY ASIAN SUBGROUP",
subtitle = "ORDERED BY AVERAGE DISCONNECTION LEVEL OF 2017 & 2018",
y = "YOUTH DISCONNECTION (%)") +
scale_fill_manual(values = custom_pal) +
scale_x_discrete(
limits = c(levels(pull(df_asian_subgroup,Race_Ethnicity))[1],
"Blank",
"NEXT_blank",
levels(pull(df_asian_subgroup,Race_Ethnicity))[2:
length(levels(pull(df_asian_subgroup,Race_Ethnicity)))]),
labels = c("Asian" = "Asian Avg.", "Blank" = "", "NEXT_blank"= "")) +
scale_y_continuous(limits = c(0,max(pull(asian_subgroups,Percent),
na.rm = TRUE)))+
bar_theme() +
# this add the legend to the plot area!:
theme(legend.justification = c(0.12,0.1),
legend.position = c(0.12,.1))
asian_subgroup_plotNice! But, this is still missing something. Perhaps if we add lines that show the average disconnection rate for the US and for Asians in general, then it will make it easier to see differences in our plot.
In the previous code we saved our plot to a object called asian_subgroup_plot.
We can modify it by adding layers to add the lines that we want.
First, we want to create tibbles for the US Asian average disconnection rates. We want to add labels to each of these. We can use the parse = TRUE argument later with the geom_text() function of the ggplot2 package to make these labels appear as bold font as it will be evaluated as an expression instead of a simple character string.
df_asian_total <-
asian_subgroups %>%
filter(Year == 2017, Gender == "All", Race_Ethnicity == "Asian") %>%
mutate(Year = as.character(Year),
label = c('bold("ASIAN 2017 AVG RATE")'))
df_US_total <-
asian_subgroups %>%
filter(Year == 2017, Gender == "All", Race_Ethnicity == "All_races") %>%
mutate(Year = as.character(Year), label = c('bold("US 2017 AVG RATE")'))
df_asian_total# A tibble: 1 x 6
Group Percent Race_Ethnicity Gender Year label
<chr> <dbl> <chr> <chr> <chr> <chr>
1 Asian 6.6 Asian All 2017 "bold(\"ASIAN 2017 AVG RATE\")"
# A tibble: 1 x 6
Group Percent Race_Ethnicity Gender Year label
<chr> <dbl> <chr> <chr> <chr> <chr>
1 United States 11.5 All_races All 2017 "bold(\"US 2017 AVG RATE\")"
Now we can use these to create labels again with the geom_text() function and lines with the geom_hline() function on our plot.
The geom_text() function requires that the x and y axis location for the text be specified, and we want this to be located near the actual percent disconnection rate, thus we use the Percent +.5 to grab the Percent value from df_asian_total and add 0.5 to it to be slightly above where the line will be.
The geom_hline() function requires, just the y intercept as this function always creates a horizontal line. The linetype argument specifies what style of line we would like. See here for a list of options.
asian_subgroup_plot <-
asian_subgroup_plot +
geom_text(data = df_asian_total,
mapping = aes(x = 2.5, y = Percent +.7, label = label),
parse = TRUE,
color = "#008393",
size = 4) +
geom_hline(aes(yintercept = pull(df_asian_total, Percent)), linetype = 2) +
geom_text(data = df_US_total,
mapping = aes(x = 2.5, y = Percent +.7, label = label),
parse = TRUE,
color = "#008393",
size = 4) +
geom_hline(aes(yintercept = pull(df_US_total, Percent)), linetype = 3)
asian_subgroup_plotAwesome! Now it is much easier to tell how the disconnection rates for various subgroups compare to the national average and the Asian average. From our plot, we can see that the Hmong and Cambodian subgroup rates are very high. If we only had the plot of major racial and ethnic groups to rely on, we might not realize that these two groups have such high rates.
Nice! This shows the importance of adding small details such as the US and Asian rate lines. This helps provide a simple yet nuanced picture of what is going on.
From the above plot, it becomes readily apparent that the Hmong and Cambodian subgroups have much higher disconnection rates than the other Asian subgroups, as well as all races / ethnicities in the US combined.
If we wanted to make a similar plot but without the subgroups that are missing a year we could do it like so, by adding the drop_na() function of the tidyr package to remove rows with NA values when the data is in wide form and then using the fct_drop() function of the forcats package to remove the levels of the Race_Ethnicity variable that no longer have rows in the data.
Click here to reveal the code.
df_asian_subgroup <-
asian_subgroups %>%
dplyr::select(-Group) %>%
filter(Gender == "All", Race_Ethnicity != "All_races") %>%
mutate(Race_Ethnicity = fct_reorder(Race_Ethnicity, Percent),
Race_Ethnicity = fct_relevel(Race_Ethnicity, "Asian")) %>%
pivot_wider(names_from = Year, values_from = Percent) %>%
tidyr::drop_na() %>%
pivot_longer(cols = -c(Race_Ethnicity, Gender),
names_to = "Year", values_to = "Percent") %>%
mutate(Race_Ethnicity = fct_drop(Race_Ethnicity))
asian_subgroup_plot_simp <-
df_asian_subgroup %>%
ggplot(aes(x = Race_Ethnicity, y = Percent, fill = Year)) +
geom_col(aes(fill = Year), position = "dodge")+
labs(title = "YOUTH DISCONNECTION BY ASIAN SUBGROUP",
subtitle = "ORDERED BY AVERAGE DISCONNECTION LEVEL OF 2017 & 2018",
y = "YOUTH DISCONNECTION (%)") +
scale_fill_manual(values = custom_pal) +
scale_x_discrete(
limits = c(levels(pull(df_asian_subgroup,Race_Ethnicity))[1],
"Blank",
"NEXT_blank",
levels(pull(df_asian_subgroup,Race_Ethnicity))[2:
length(levels(pull(df_asian_subgroup,Race_Ethnicity)))]),
labels = c("Asian" = "Asian Avg.",
"Blank" = "",
"NEXT_blank"= ""))+
scale_y_continuous(limits = c(0,max(pull(asian_subgroups,Percent),
na.rm = TRUE)))+
bar_theme() +
#this add the legend to the plot area!:
theme(legend.justification = c(0.16,0.01),legend.position = c(0.16,.01)) +
geom_text(data = df_asian_total,
mapping = aes(x = 2.5, y = Percent +.7, label = label),
parse = TRUE, color = "#008393", size = 4) +
geom_hline(aes(yintercept = pull(df_asian_total, Percent)), linetype = 2) +
geom_text(data = df_US_total,
mapping = aes(x = 2.5, y = Percent +.7, label = label),
parse = TRUE, color = "#008393", size = 4) +
geom_hline(aes(yintercept = pull(df_US_total, Percent)), linetype = 3)Now let’s do the same for the Latinx subgroups.
First let’s make the small tibble of the 2017 Latinx average rate like we did for the US and for Asians to create the lines and text on our plot.
df_latinx_total <-
latinx_subgroups %>%
filter(Year == 2017, Gender == "All", Race_Ethnicity == "Latinx") %>%
mutate(Year = as.character(Year),
label = c('bold("LATINX 2017 AVG RATE")'))Now we will use similar code for the latinx subgroups. The only difference is that we will also change the "Puerto Rican, Dominican, Cuban" values to have new line breaks between the groups. First let’s plot all subgroups.
Click here to reveal the code.
latinx_for_plot <-latinx_subgroups %>%
dplyr::select(-Group) %>%
filter(Gender == "All", Race_Ethnicity != "All_races") %>%
mutate(Race_Ethnicity = str_replace(
string = Race_Ethnicity,
pattern = "Puerto Rican, Dominican, Cuban",
replacement = "Puerto Rican,\nDominican,\nCuban")) %>%
mutate(Race_Ethnicity = fct_reorder(Race_Ethnicity, Percent),
Race_Ethnicity = fct_relevel(Race_Ethnicity, "Latinx")) %>%
pivot_wider(names_from = Year, values_from = Percent) %>%
pivot_longer(cols = -c(Race_Ethnicity, Gender),
names_to = "Year" ,
values_to = "Percent")
latinx_subgroup_plot <-
latinx_for_plot %>%
ggplot(aes(x = Race_Ethnicity, y = Percent, fill = Year)) +
geom_col(aes(fill = Year), position = "dodge") +
labs(title = "YOUTH DISCONNECTION BY LATINX SUBGROUP",
subtitle = "ORDERED BY AVERAGE DISCONNECTION LEVEL OF 2017 & 2018",
y = "YOUTH DISCONNECTION (%)") +
scale_fill_manual(values = custom_pal) +
scale_x_discrete(
limits = c(levels(pull(latinx_for_plot,Race_Ethnicity))[1],
"Blank",
"NEXT_blank",
levels(pull(latinx_for_plot,Race_Ethnicity))[2:
length(levels(pull(latinx_for_plot,Race_Ethnicity)))]),
labels = c("Latinx" = "Latinx Avg.",
"Blank" = "",
"NEXT_blank"= ""))+
bar_theme() +
#this add the legend to the plot area!:
theme(legend.justification = c(0.23,0.1),legend.position = c(0.23,.1))+
geom_text(data = df_latinx_total,
mapping = aes(x = 3, y = Percent +.7, label = label),
parse = TRUE, color = "#008393", size = 4) +
geom_hline(aes(yintercept = pull(df_latinx_total, Percent)), linetype = 2) +
geom_text(data = df_US_total,
mapping = aes(x = 3, y = Percent +.7, label = label),
parse = TRUE, color = "#008393", size = 4) +
geom_hline(aes(yintercept = pull(df_US_total, Percent)), linetype = 3)Now let’s drop the subgroups that don’t have values for both years.
Click here to reveal the code.
latinx_for_plot <-latinx_subgroups %>%
dplyr::select(-Group) %>%
filter(Gender == "All") %>%
filter(Race_Ethnicity != "All_races") %>%
mutate(Race_Ethnicity = str_replace(
string = Race_Ethnicity,
pattern = "Puerto Rican, Dominican, Cuban",
replacement = "Puerto Rican,\nDominican,\nCuban")) %>%
mutate(Race_Ethnicity = fct_reorder(Race_Ethnicity, Percent)) %>%
mutate(Race_Ethnicity = fct_relevel(Race_Ethnicity, "Latinx")) %>%
pivot_wider(names_from = Year, values_from = Percent) %>%
tidyr::drop_na() %>%
pivot_longer(cols = -c(Race_Ethnicity, Gender),
names_to = "Year" ,
values_to = "Percent")%>%
mutate(Race_Ethnicity = fct_drop(Race_Ethnicity))
latinx_subgroup_plot_simp <-latinx_for_plot %>%
ggplot(aes(x = Race_Ethnicity, y = Percent, fill = Year)) +
geom_col(aes(fill = Year), position = "dodge")+
labs(title = "YOUTH DISCONNECTION BY LATINX SUBGROUP",
subtitle = "ORDERED BY AVERAGE DISCONNECTION LEVEL OF 2017 & 2018",
y = "YOUTH DISCONNECTION (%)") +
scale_fill_manual(values = custom_pal) +
scale_x_discrete(
limits = c(levels(pull(latinx_for_plot,Race_Ethnicity))[1],
"Blank",
"NEXT_blank",
levels(pull(latinx_for_plot,Race_Ethnicity))[2:
length(levels(pull(latinx_for_plot,Race_Ethnicity)))]),
labels = c("Latinx" = "Latinx Avg.",
"Blank" = "",
"NEXT_blank"= ""))+
scale_y_continuous(limits = c(0,max(pull(latinx_subgroups,Percent),
na.rm = TRUE)))+
bar_theme()+
#this add the legend to the plot area!:
theme(legend.justification = c(0.23,0.01),legend.position = c(0.23,.01))+
geom_text(data = df_latinx_total,
mapping = aes(x = 3, y = Percent +.7, label = label),
parse = TRUE,
color = "#008393",
size = 4) +
geom_hline(aes(yintercept = pull(df_latinx_total, Percent)), linetype = 2) +
geom_text(data = df_US_total,
mapping = aes(x = 3, y = Percent +.7, label = label),
parse = TRUE,
color = "#008393",
size = 4) +
geom_hline(aes(yintercept = pull(df_US_total, Percent)), linetype = 3)In our data, we have pooled (repeated) cross-sectional data.
This is data produced from repeated measurement of a population over time.
It is often unfeasible to collect data for an entire population at once. However, we can still obtain meaningful measures using a random sample of the population.
At specific time-points, data is collected from a sample of the population. The individuals in each sample are not necessarily the same individuals. This separates pooled cross-sectional data from panel data, which is longitudinal data from repeated measurement of the same people.
By sampling from a population at multiple time points, we can generate estimates of population level statistics. Although these statistics have some random error, they can provide insight into how the measure variable is changing in a population over time.
As we have measured values over time, we can plot the data to see if there is a trend (increasing, decreasing, monotonic, non-monotonic, linear, non-linear, etc). Sometimes, however, the trend isn’t exactly clear. Fortunately, there are statistical methods to resolve this issue.
Next, we will introduce and discuss the Mann-Kendall trend test.
The Mann-Kendall trend test, (which is a variation of the Kendall rank correlation coefficient), tests whether there is a monotonic association between two variables, or a relationship with consistent direction between two variables, where one variable (typically time) determines the order of the other variable. Thus this test is most often used to see if there is a consistent change in direction (or mathematical sign) for a variable across time.
As you can see in the image below, a monotonic relationship does not need to be linear, it simply needs to show a consistent direction of change. A linear relationship on the other hand shows consistent direction and rate of change.
So in other words, we are interested to see if there is a consistent direction in the relationship of disconnection rates with time, as we want to know if rates have been consistently increasing or decreasing or if they have been staying roughly the same.
Note: this test does not test if the rate of change has been consistent over time, just the direction.
In contrast to the Mann-Kendall trend test, we briefly compare this to Simple linear regression.
There are a couple of difference between the two:
Linear regression assesses if there a consistent change in both rate and direction in the dependent variable as a single independent variable changes. The coefficients tell us how much the dependent variable changes with one unit change of the independent variable.
The Mann-Kendall trend test is a nonparametric test, which means that it does not require as many assumptions as some parametric tests, like the simple linear regression.
The null hypothesis \(H_{0}\) of the test is that there is no monotonic trend, while the alternative hypothesis \(H_{a}\) is that there is a monotonic trend.
According to this report, the Mann-Kendall \(S\) score is calculated as follows:
The data values are evaluated as an ordered time series. Each data value is compared to all subsequent data values. The initial value of the Mann-Kendall statistic, \(S\), is assumed to be 0 (e.g., no trend). If a data value from a later time period is higher than a data value from an earlier time period, \(S\) is incremented by 1. On the other hand, if the data value from a later time period is lower than a data value sampled earlier, \(S\) is decremented by 1. The net result of all such increments and decrements yields the final value of \(S\).
\[S =\sum_{k=1}^{n-1}\sum_{j = k+1}^{n} sign(x_{j} - x_{k}) \]
where we assume there are \(n\) time points and observations \(x_{j}\) and \(x_{k}\) at two time points and:
\[sign(x_{j}-x_{k}) = 1 \text{, if } x_{j}-x_{k}>0\] \[sign(x_{j}-x_{k}) = 0 \text{, if } x_{j}-x_{k}=0\] \[sign(x_{j}-x_{k}) = -1 \text{, if } x_{j}-x_{k}<0\]
Another way of stating this is all possible cases where a value occurred later in time must be compared with values that occurred earlier in time, giving \(n(n–1)/2\) comparisons, like so if there were four time points (\(n = 4\)):
\[S = sign(x_2−x_1) + sign(x_3−x_1) + sign(x_4-x_1) + sign(x_3−x_2) + sign(x_4-x_2) + sign(x_4-x_3)\]
which is consistent with \(4(4-1)/2 = 4(3)/2 = 12/2 = 6\) comparisons.
Once we have calculated the Mann-Kendall \(S\) score, we want to know if the score is meaningfully large or small enough from what we would expect with data with no trend. To determine that, we need to know what the variance is of the Mann-Kendall \(S\) score (or how much variability we expect).
We won’t explain the details here, but it turns out the variance of the Mann-Kendall \(S\) score (or \(\text{var}(S)\)) is calculated like so:
\[ \text{var}(S)=\frac{1}{18}(n(n−1)(2n+5)−\sum_{p-1}^{g}t_{p}(t_{p}−1)(2t_{p}+5))\] where:
\[g = \text{number of tied groups}\] \[t_{p} = \text{number of observations in the }p^{th}\text{ group for a given tie}\]
For example, if 3 observations in our example were 35, then there would be 1 \(g\) tied group and 3 \(t_{p}\) observations.
Note: according to this source, a minimum of 4 observations should be used for this test and it is recommended to have 8 or more.
Putting this all together, the test statistic we are interested in (to know if the score is meaningfully large or small enough from what we would expect with data with no trend) \(Z_{MK}\) is calculated as follows:
if \(S > 0\) then \(Z_{MK} = \frac{S-1}{\sqrt{\text{var}(S)}}\)
if \(S < 0\) then \(Z_{MK} = \frac{S+1}{\sqrt{\text{var}(S)}}\)
if \(S = 0\) then \(Z_{MK} = 0\)
This test statistic \(Z_{MK}\) is considered a Z score (also known as a standard or standardized score). We can use standard look up tables from there to calculate a p-value. We will walk through an example of how to do this below.
Let’s conduct a Mann-Kendall trend test using our data to see if there are any monotonic trends in youth disconnection across time.
Recall that the youth disconnection rates for Native Americans were some of the highest in the first table we examined.
and the plot we made:
Let’s consider just observed values for the Native American subgroup. The first value in time was (\(x_1\) = time point 1) 24.4, then (\(x_2\)) 28.8 at the next time point, then (\(x_3\)) 27.0, then (\(x_4\)) 26.3, then (\(x_5\)) 25.8, and finally (\(x_6\)) 23.9.
If we were to calculate the Mann-Kendall statistic manually we could do the following:
\[ S = sign(x_2−x_1) + sign(x_3-x_1) + sign(x_4-x_1) + sign(x_5-x_1) + sign(x_6-x_1) +\]
\[ sign(x_3-x_2) + sign(x_4-x_2) + sign(x_5-x_2) + sign(x_6-x_2) +\]
\[sign(x_4-x_3) + sign(x_5-x_3) + sign(x_6-x_3) +\]
\[sign(x_5-x_4) + sign(x_6-x_4) +\]
\[sign(x_6-x_5)\]
Thus to calculate \(S\) manually in R we could do like so:
x1 = 24.4
x2 = 28.8
x3 = 27
x4 = 26.3
x5 = 25.8
x6 = 23.9
sign(x2-x1) + sign(x3-x1) + sign(x4-x1) + sign(x5-x1) + sign(x6-x1) +
sign(x3-x2) + sign(x4-x2) + sign(x5-x2) + sign(x6-x2) +sign(x4-x3) + sign(x5-x3) + sign(x6-x3) +sign(x5-x4) + sign(x6-x4) + sign(x6-x5)[1] -7
The result is -7.
The \(\text{var}(S)\) would be calculated like so (since we have \(n\) = 6 observations and no ties):
\[ \text{var}(S)=\frac{1}{18}(6(6−1)(2(6)+5)−\sum_{p-1}^{g}0(0−1)(2(0)+5))\] \[ \text{var}(S)=\frac{1}{18}(30)(17)−0 =1.666667(17) = 28.33333\]
Finally, we calculate the standardized score \(Z_{MK}\) (with \(S\) < 0):
\(Z_{MK} = \frac{S+1}{\sqrt{VAR(S)}}\) \(Z_{MK} = \frac{-7+1}{\sqrt{28.33333}} = \frac{-6}{5.322906} = -1.127\)
As we noted above, the \(Z_{MK}\) is the specific Mann-Kendall version of the Z score. This allows us to use a variety of data (which maybe normally distributed or not) and create a standard statistic about the data that is normally distributed.
Using the \(Z_{MK}\) Z score, we can use a standard \(Z\) table to determine the \(p\)-value.
The \(p\)-value tells us the probability of observing a test statistic as or more extreme than the one we observed if the null hypothesis (that there is no trend) is true. If the \(p\)-value is small, say less than 0.05, then our observed \(Z_{MK}\) statistic is pretty rare if the null hypothesis is true. This would lead us to reject the null hypothesis and conclude that there is indeed a trend in the data. We will talk more about \(p\)-values, and in particular our choice of 0.05 as a comparison for the \(p\)-value, a little later. See this case study for more information about probability and hypothesis testing.
OK back to our example. Let’s figure out what our \(p\)-value is using the standard \(Z\) score table.
Using the table below, in the the \(Z\) column on the far left, we find a row that says -1.1 and because our \(Z\) value is -1.127 we round this to -1.13 and then find the column for 0.03 to account for the value that is second after decimal place for our \(Z\) score. We see that the value where this row and this column intersect is 0.1292. This is the \(p\)-value, that indicates the probability of having a \(Z\) score that is more extreme (on the negative side) of the \(Z\) score normal distribution or the colored portion of the distribution in the figure below. In other words, there is a 13% chance of observing data with a more negative \(Z\) score (and also a downward trend as the sign of the \(Z_{MK}\) is determined by the \(S\) score) simply by chance alone.
Since we do not know what type of trend we are looking for (thus we are using what is called a two-sided hypothesis - with the the alternative hypothesis or \(H_{a}\) being that either a positive or negative trend exists), we can use this table to determine that the \(p\)-value is ~.1292 for a one-sided hypothesis (specifically testing that a negative trend exists or testing that a positive trend exists) and multiply this by 2 to get a \(p\)-value of approximately ~ 0.26 for a two-sided hypothesis.
See this article for more information about one-sided and two-sided tests.
Thus in other words, there is a 26% chance of observing a more positive or more negative z score by chance alone. This is a pretty high chance!
Thus visualize this, our two-sided test looks for the areas on the probability distribution of the Z score that are more positive or more negative for both signs of the \(Z\) value:
Alternatively, we can use this calculator, which gives us a more precise \(p\)-value of 0.26 for the two-sided hypothesis.
Either way, our \(p\)-value of ~ 0.26 suggests that we do not have a monotonic trend as our \(p\)-value is much larger than the typical significance threshold of 0.05. The idea here is that it is acceptable that there is a 5% probability that our observation occurred simply due to chance.
In our case there is not enough evidence to reject the null hypothesis. Thus we conclude that there is likely no trend.
We can also accomplish this with the MannKendall() function of the Kendall package. This function requires a vector of data for which a trend may be observed. See the documentation for this function of the Kendall package for more details.
In the following we:
pull() function of the dplyr package to get just the vector of this dataMannKendall() function from the Kendall packagesummary() function to get more information for the results of our Mann-Kendall trend testmajor_groups_long %>%
filter(Gender == "All", Race_Ethnicity == "Native American") %>%
pull(Percent) %>%
MannKendall(.) %>%
summary()Score = -7 , Var(Score) = 28.33333
denominator = 15
tau = -0.467, 2-sided pvalue =0.25966
Again, just like when we calculated this manually, we see that the score \(S\) is -7. The larger the score, the more indication that there is an increasing or decreasing monotonic relationship. We also again see that the \(p\)-value is roughly = 0.26 and is greater than 0.05, suggesting that we failed to find a monotonic trend in the data.
However, it’s important to note that we only have 6 observations. Thus, we may not have enough observations to observe a trend.
We can also explore whether there is a linear trend using simple linear regression by using the lm() function of the stats package.
First, we visualize the data with Year on the x-axis and Percent of youth disconnection on the y-axis.
major_groups_long %>%
filter(Gender == "All", Race_Ethnicity == "Native American") %>%
ggplot(aes(x = Year, y = Percent)) +
geom_point() We are interested in modeling how the percent disconnection changes with time, thus we will use the formula Percent ~ Year in the lm() function, which uses the Percent and Year variables. See this case study for more information about linear regression.
major_groups_long %>%
filter(Gender == "All",
Race_Ethnicity == "Native American") %>%
lm(Percent ~ Year, data = .) %>%
summary()
Call:
lm(formula = Percent ~ Year, data = .)
Residuals:
1 2 3 4 5 6
-2.4332 2.2978 0.8288 0.4597 0.2907 -1.4438
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 359.1159 487.3956 0.737 0.502
Year -0.1655 0.2421 -0.683 0.532
Residual standard error: 1.889 on 4 degrees of freedom
Multiple R-squared: 0.1045, Adjusted R-squared: -0.1193
F-statistic: 0.467 on 1 and 4 DF, p-value: 0.5319
Based on the coefficients in the output, for each one year change, the mean increase in disconnection rates is -0.1654795
We can also see from the \(p\)-value (Pr(>|t|)), that this relationship is not statistically significant, as it is much larger than 0.05. Again, we are however largely limited by the number of observations in this dataset.
We can visualize the relationship above like so using the geom_smooth() function of the ggplot2 package with the method = "lm" argument to show the linear model.
major_groups_long %>%
filter(Gender == "All",
Race_Ethnicity == "Native American") %>%
ggplot(aes(x = Year, y = Percent)) +
geom_smooth(method = "lm", color = "red") +
geom_point() +
scale_x_continuous(breaks = seq(2008, 2018, by = 1),
labels = seq(2008, 2018, by = 1),
limits = c(2008, 2018)) +
theme_minimal() +
labs(title = "Youth Disconnection Rates of Native American Youth",
subtitle = "2008 - 2017",
x = "Year",
y = "Disconnection Rate")As we can see, that although the line appears to have a negative slope, there is a large amount of uncertainty around the fitted line.
What about if we look at data that we think we might see a trend according to our data visualizations. The female Latinx data looks as though it may be decreasing:
major_groups_long %>%
filter(Gender == "Female", Race_Ethnicity == "Latinx") %>%
pull(Percent) %>%
MannKendall(.) %>%
summary()Score = -13 , Var(Score) = 28.33333
denominator = 15
tau = -0.867, 2-sided pvalue =0.024171
Indeed this data does shows a trend of declining rates of youth disconnection as the score is again negative and larger and the \(p\)-value is now much less than 0.5 for even the 2-sided hypothesis.
The Z score can be calculated like so:
\(Z_{MK} = \frac{-13+1}{\sqrt{28.33333}} = \frac{-12}{5.322906} = -2.254408\)
The one tailed p-value is half of the 2-sided p-value and is thus roughly 0.012. We just tested two hypotheses, one for the Native Americans and one for the Latinx female population. So we should use the Bonferroni multiple testing correction to adjust our significance threshold \(\alpha\) to be \(.05/(n\text{ tests}) = 0.5/2 = .025\). In this method we divide our significance threshold alpha (generally 0.05) by the number of tests. Why do we need to do this?
Alpha is the amount of risk that you are willing to accept that a statistical test result is actually a false positive, in other words we reject the null hypothesis when the null hypothesis is true. This is also called type 1 error. A significance threshold of .05 is customary and it means that we are accepting a 5% chance that our result is actually a false positive. When we perform multiple tests, the probability of not making a type 1 error would remain the same for each test. But the probability overall of making an error across the tests increases. We can calculate this by multiplying the probabilities for each test(which stays the same for each individual test). See here about why we multiply probabilities together. The more tests we do, the larger the chance that we get a significant result in error. See this case study for more details about multiple testing correction.
Our \(p\)-value for the one sided test was 0.012 which is less than our adjusted significance threshold of .025 for multiple testing. This suggests that indeed there is a decreasing trend for youth disconnection among the Female Latinx population.
Now we will create a plot that summarize our findings.
We will create a plot that is just an arrow using geom_segment().
This requires the following arguments : 1) x - x coordinate of starting point of the arrow 2) y - y coordinate of starting point of the arrow 3) xend - x coordinate of end point of the arrow 4) yend - y coordinate of end point if the arrow
Also ggplot requires a data input, we will use this to create the y axis value for the arrow. We will use the aes() function to define the length and location of the arrow.
We need to use the arrow argument with the arrow() function to create an arrow. The lineend and linejoin arguments specify what styles to use. The size() function specifies how thick the arrow is displayed.
See here for different arrow style options.
Then we will use the xlim() function of the ggplot2 package to specify the overall size of the plot relative to the arrow.
y
1 1
arrowplot<-ggplot(arrow_df, aes(x = .5, y = 1, xend = 1, yend = 1)) +
geom_segment(arrow = arrow(),
lineend = "butt",
linejoin = "mitre",
size = 3)+
xlim(0.25, 1.25) + ylim(0,1.5)+
geom_text( label = "Look at\n subgroups", aes(x = .75, y = .8), size = 5, face = "bold")
arrowplotThen we will use theme_void() remove the background.
We will use the patchwork package to do so.
This package allows us to create formulas for our plot layouts. Thus we can use + to add plots and plot elements like plot_spacer() to add an empty spaceand we can use/` to place plots in different rows on top of one another like a fraction.
Thus we can combine our plots that we previously named and saved as objects.
The plot_layout() function allows for specification of the relative heights and widths of the various plot elements. Then, since we will create three columns (distinguished by the + symbols, we need to specify widths and heights for all three.)
We can change aspects about the plots using the + symbol and the theme() function as we have previously used it. We will also add an overall title using the plot_annotation() function.
Finally, we will use the png() function of the grDevices package which is automatically loaded in RStudio sessions to save the plot. We will use the here() function of the here package, to specify that we want to save it in the img directory and call it mainplot.png. We can also use this function to specify the resolution with res and in doing so, we need to save the image with size specifications to make it larger.
plot_by_gender <-plot_by_gender + theme(axis.text =element_text(size = 12))
asian_subgroup_plot_simp <-asian_subgroup_plot_simp + theme(axis.text =element_text(size = 12))
png(filename = here::here("img", "mainplot.png"),
res = 300, width = 15, height = 10, units = "in")
youth_plot +
theme(axis.text =element_text(size = 14, angle = 90)) +
arrowplot +
(plot_by_gender /
asian_subgroup_plot_simp) +
plot_layout(widths = c(4.6, 1, 4.65)) &
plot_annotation(title = "Deeper Inspection Reveals Unexpected Differences for Youth Disconnection Among\nGender and Racial/Ethnic Subgroups",
theme = theme(plot.title = element_text(size = 22,
face = "bold")))
dev.off()quartz_off_screen
2
In this case study we evaluated rates of youth disconnection in the United States between 2008 and 2018, which was defined as individuals between age 16 - 24 who are neither working nor in school". In particular, we focused on evaluating specific subgroups to identify populations that may experience particularly high levels of disconnection. We used summary level data from two Measure of America reports:
The data in these reports comes from the American Community Survey(ASC).
We demonstrated how to import data from these PDFs by extracting the text from screen shot images of the tables were were particularly interested in.
It was identified in the 2019 Measure of America annual report that Native American populations particularly experience high rates of youth disconnection. Indeed roughly a quarter of this population appears to experience youth disconnection, which is much higher than the other major racial groups. By performing the Mann-Kendall trend test(https://www.statisticshowto.com/mann-kendall-trend-test/) both manually and using the MannKendall() function of the Kendall package, we demonstrated that the rate does not appear to be changing. This suggests that more prevention and re-engagement strategies should especially focus on this population.
By plotting subgroups within our data, it is clear that there are differences among particular populations. We see that there are much higher levels of disconnection among youths who are male and black as compared to black females.
We also see that youth disconnection rates among Hmong and Cambodian populations are very high, despite rates being quite low for the Asian population in general.
Overall, this case study demonstrates the importance of looking at populations at deeper levels. This is something of great importance facing the field of public health. Where possible data should be collected on a variety of demographic groups to better characterize vulnerable populations.
magick package (for example perhaps the data about different states over time in the 2019 report called Making the Connection). Look for differences between groups by plotting the data and evaluating with the Mann-Kendall test.RStudio
Cheatsheet on RStuido IDE
Other RStudio cheatsheets
Tidyverse
Response bias
Cross-Sectional data Population Sample Sampling methods Inference
American Community Survey (ASC)
See here for more detailed information about the survey
Measure of America
Social Science Research Council
Piping in R
Writing functions
Also see this case study for more information on writing functions.
String manipulation cheatsheet
Table formats
Regression
simple linear regression
monotonic association
Kendall rank correlation coefficient
Null hypothesis
Alternative hypothesis
Probability
one-sided and two-sided hypotheses
Nonparametric Parametric significance threshold
Z score
Z score table
Z score to p-value calculator
ggplot2 package
Please see this case study for more details on using ggplot2
grammar of graphics
ggplot2 themes
directlabels package methods
Hmong people
Intersections
Motivating article for this case study about youth disconnection/opportunity youth
To learn more about importing and wrangling PDFs using the pdftools package see this case study and this case study.
To learn more about what you can do with the magick package see this vingette.
To learn more about the Mann-Kendall trend test see here and here.
To learn more about hypothesis testing, see this case study.
Packages used in this case study:
| Package | Use in this case study |
|---|---|
| here | to easily load and save data |
| pdftools | to import PDF documents |
| magick | for importing images and extracting text from images |
| tesseract | for extracting text from images with magick |
| knitr | for showing images in reports |
| dplyr | to filter, subset, join, add rows to, and modify the data |
| stringr | to manipulate strings |
| magrittr | to pipe sequential commands |
| tidyr | to change the shape or format of tibbles to wide and long, to drop rows with NA values, to separate a column into additional columns, and to fill out values based on previous values |
| tibble | to create tibbles |
| ggplot2 | to create plots |
| directlabels | to add labels directly to lines in plots |
| cowplot | to add images to plots |
| forcats | to reorder factor for plot |
| kendall | to implement the Mann-Kendall trend test in R |
| patchwork | to combine plots |
Want to learn more about how to prevent and mitigate youth disconnection?
Or do you know youths who are disconnected?
See the program directory at youth.gov and this program listing focused on Maryland but including other locations for listings of programs dedicated to re-engaging disconnected youth or preventing disconnection.
Also, see The Center for Communities That Care and the PROSPER program for particular examples.
We would like to acknowledge Tamar Mendelson for assisting in framing the major direction of the case study.
We would also like to acknowledge the Bloomberg American Health Initiative for funding this work.